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ABSTRACT  OF  THE  DISSERTATION 


Variational  and  PDE-based  methods 
for  big  data  analysis,  classification  and  image 
processing  using  graphs 

by 

Ekaterina  Merkurjev 

Doctor  of  Philosophy  in  Mathematics 
University  of  California,  Los  Angeles,  2015 
Professor  Andrea  Bertozzi,  Chair 


We  present  several  graph-based  algorithms  for  image  processing  and  classification  of  high¬ 
dimensional  data.  The  first  (semi-supervised)  method  uses  a  graph  adaptation  of  the  classical 
numerical  Merriman-Bence-Osher  (MBO)  scheme,  and  can  be  extended  to  the  multiclass  case  via 
the  Gibbs  simplex.  We  show  examples  of  the  application  of  the  algorithm  in  the  areas  of  image 
inpainting  (both  binary  and  grayscale),  image  segmentation  and  classification  on  benchmark  data 
sets.  We  have  also  applied  this  algorithm  to  the  problem  of  object  detection  using  hyperspectral 
video  sequences  as  a  data  set.  In  addition,  a  second  related  model  is  introduced.  It  uses  a  diffuse 
interface  model  based  on  the  Ginzburg-Landau  functional,  related  to  total  variation  compressed 
sensing  and  image  processing.  A  multiclass  extension  is  introduced  using  the  Gibbs  simplex, 
with  the  functional’s  double-well  potential  modified  to  handle  the  multiclass  case.  The  version 
minimizes  the  functional  using  a  convex  splitting  numerical  scheme.  In  our  computations,  we  make 
use  of  fast  numerical  solvers  for  finding  the  eigenvectors  and  eigenvalues  of  the  graph  Laplacian, 
and  take  advantage  of  the  sparsity  of  the  matrix.  Experiments  on  benchmark  data  sets  show  that 
our  models  produce  results  that  are  comparable  with  or  outperform  the  state-of-the-art  algorithms. 

The  second  (semi- supervised)  method  develops  a  global  minimization  framework  for  binary 
classification  of  high-dimensional  data.  It  combines  recent  convex  optimization  methods  for  im- 
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age  processing  with  recent  graph  based  variational  models  for  data  segmentation.  Two  convex 
splitting  algorithms  are  proposed,  where  graph-based  PDE  techniques  are  used  to  solve  some  of 
the  subproblems.  It  is  shown  that  global  minimizers  can  be  guaranteed  for  semi- supervised  seg¬ 
mentation  with  two  regions.  If  constraints  on  the  volume  of  the  regions  are  incorporated,  global 
minimizers  cannot  be  guaranteed,  but  can  often  be  obtained  in  practice  and  otherwise  be  closely 
approximated.  We  perform  a  thorough  comparison  to  recent  MBO  (Merriman-Bence-Osher)  [81] 
and  phase  field  methods,  and  show  the  advantage  of  the  proposed  algorithms. 

Lastly,  we  present  the  current  work  (unsupervised  method)  related  to  normalized  cuts.  The 
method  uses  a  clever  alternative  to  the  normalized  cut  to  solve  the  binary  classification  problem.  In 
particular,  we  work  with  the  Ginzburg-Landau  functional.  In  addition,  we  use  a  generalized  graph¬ 
ical  framework,  so  several  different  graph  Laplacians  are  tested  and  their  results  are  compared. 
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CHAPTER  1 


Introduction 


We  present  several  methods,  outlined  in  Chapters  3-5,  for  image  processing  and  data  classifica¬ 
tion  using  a  graphical  framework.  The  framework  is  often  used  to  exploit  underlying  similarities 
in  the  data  [6,22, 106, 113-115].  For  example,  spectral  graph  theory  [24,84]  uses  this  approach 
to  perform  various  tasks  in  imaging  and  data  clustering.  Graph-based  formulations  have  been  also 
used  extensively  for  image  processing  applications  [10, 26, 27, 36, 56, 57, 59, 74, 96].  Specifically, 
algorithms  for  image  denoising  in  [  1 7] ,  image  inpainting  and  reconstruction  in  [5 1 , 9 1 , 1 1 1  ] ,  image 
deblurring  in  [77]  and  manifold  processing  in  [36]  all  utilize  such  formulations.  We  use  a  non¬ 
local  calculus  formulation  [103]  to  generalize  the  continuous  formulation  to  a  (non-local)  discrete 
setting,  while  other  non-local  versions  for  weighted  graphs  are  described  in  [36].  A  comprehensive 
reference  about  casting  continuous  PDEs  in  graph  form  is  found  in  [58]. 

Chapter  3  develops  a  fast  algorithm  (MBO  method)  for  classification  and  image  processing. 
The  method  is  inspired  by  diffuse  interface  models  that  have  been  used  in  a  variety  of  problems, 
such  as  those  in  fluid  dynamics  and  materials  science.  As  an  alternative  to  L 1  compressed  sens¬ 
ing  methods,  Bertozzi  and  Flenner  introduce  a  graph-based  model  based  on  the  Ginzburg-Landau 
functional  in  their  work  [9] .  To  define  the  functional  on  a  graph,  the  spatial  gradient  is  replaced 
by  a  more  general  graph  gradient  operator.  Analogous  to  the  continuous  case,  the  first  variation 
of  the  model  yields  a  gradient  descent  equation  with  the  graph  Laplacian,  which  is  then  solved  by 
a  numerical  scheme  with  convex  splitting.  To  reduce  the  dimension  of  the  graph  Laplacian  and 
make  the  computation  more  efficient,  the  authors  propose  the  Nystrom  extension  method  [44]  to 
approximate  eigenvalues  and  the  corresponding  eigenvectors  of  the  graph  Laplacian.  Moreover, 
many  applications  suggest  that  the  MBO  scheme  of  Merriman,  Bence  and  Osher  [82]  for  approx¬ 
imating  the  motion  by  mean  curvature  performs  very  well  in  minimizing  functionals  built  around 
the  Ginzburg-Landau  functional.  For  example,  the  authors  of  [39]  propose  an  adaptation  of  the 
scheme  to  solve  the  piecewise  constant  Mumford-Shah  functional.  This  inspired  us  to  adapt  the 
MBO  scheme  [82]  for  solving  graph  based  equations  to  create  a  simple  algorithm  that  achieves 
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faster  convergence  through  a  small  number  of  computationally  inexpensive  iterations.  The  result¬ 
ing  MBO  method  can  be  applied  to  various  problems,  such  as  image  processing,  classification  and 
object  detection  in,  for  example,  hyperspectral  data. 

We  proceed  with  the  chapter  by  presenting  two  graph-based  algorithms  for  multiclass  clas¬ 
sification  of  high  dimensional  data.  The  multiclass  extension  is  obtained  using  the  Gibbs  sim¬ 
plex.  The  first  algorithm  minimizes  the  Ginzburg-Landau  functional  using  gradient  descent  and  a 
convex-splitting  scheme.  The  second  algorithm  is  an  extension  of  the  MBO  method  already  intro¬ 
duced.  It  uses  fewer  parameters  than  the  first  algorithm,  and  while  this  may  in  some  cases  make  it 
more  restrictive,  in  our  experiments  it  was  highly  accurate  and  efficient.  Both  of  these  algorithms 
demonstrate  how  methods  motivated  by  the  PDE  literature  can  be  productively  adapted  to  graphs, 
producing  effective  multiclass  data  segmentation  methods. 

The  theoretical  portion  of  the  chapter  is  concluded  with  a  presentation  of  an  application  of  the 
multiclass  MBO  algorithm  to  hyperspectral  video  data.  We  use  the  Nystrom  extension  method 
to  efficiently  calculate  the  needed  eigenvectors.  This  implementation  of  the  algorithm  requires 
an  operator  assisted  spectral  clustering  preprocessing  step  to  identify  a  subset  of  pixels  denoted  as 
“ground  truth”  for  the  four  classes.  The  resulting  classification  of  chemical  plumes  and  background 
pixels  are  excellent.  Only  a  small  number  of  eigenvectors  is  needed  to  achieve  a  good  result  and 
no  preprocessing  is  necessary. 

Chapter  4  develops  a  global  minimization  framework  for  segmentation  of  high  dimensional 
data  into  two  classes.  Instead  of  applying  classical  combinatorial  algorithms,  we  build  on  more 
recent  work  from  imaging,  which  formulates  two  class  partition  problems  as  convex  variational 
problems  [11,21,54]  or  variational  min-cut/max-flow  problems  [108, 109].  Convex  optimization 
algorithms  were  used  in  [11,54, 108, 109]  to  split  the  problems  into  simpler  subproblems,  each 
of  which  could  be  solved  by  PDE  techniques.  In  this  chapter,  we  describe  the  extension  of  the 
variational  min-cut/max-flow  duality  in  [108,  109]  and  of  the  algorithm  in  [54,  107]  to  a  more 
general  graph  setting  to  solve  a  more  general  clustering  problem.  The  new  subproblems  are  solved 
by  graph-based  PDE  techniques.  We  also  show  how  constraints  on  the  size  of  each  class  can  be 
incorporated  by  a  small  modification  of  the  max-flow  algorithm.  The  advantage  of  the  methods 
proposed  in  this  chapter  is  the  fact  that  they  are  convex,  unlike  those  in  the  earlier  chapter,  which 
have  the  potential  of  occasionally  being  stuck  in  a  local  minima. 

While  Chapters  3  and  4  involve  semi- supervised  algorithms,  Chapter  5  develops  an  unsuper- 
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vised  method,  where  there  is  no  a  priori  knowledge  of  the  labeling  of  some  of  the  data  points. 
This  is  a  harder  problem,  but  it  has  the  advantage  of  not  requiring  one  to  know  part  of  the  ground 
truth.  The  goal  of  the  chapter  is  binary  classification,  same  as  in  Chapter  4.  The  novel  binary 
unsupervised  clustering  algorithm  introduced  is  a  modification  of  the  normalized  cut  problem. 

One  of  the  ways  to  cluster  a  target  set  X  (defined  on  a  graph  G  =  (V.  E )  and  the  weight 
function  w  defined  on  the  set  of  edges)  into  two  clusters  is  to  find  a  partition  X  =  S  U  S  such  that 
the  following  value  is  minimized: 

cut(S,  S)  —  Y  w(x>y)- 

x£S,y£S 

The  intuition  behind  this  is  the  fact  that  we  want  to  find  a  partition  such  that  the  weights  between 
vertices  of  the  same  set  are  large,  and  weights  between  vertices  of  different  sets  are  small.  In  other 
words,  we  want  to  group  vertices  that  are  alike  together,  and  put  those  that  are  dissimilar  in  different 
groups.  However,  minimizing  the  above  problem  usually  leads  to  an  undesirable  solution,  because 
it  tends  to  isolate  individual  vertices  from  the  rest  of  the  graph.  Usually,  what  is  wanted  is  for  the 
two  sets  to  be  relatively  close  in  size.  Thus,  some  sort  of  normalization  is  usually  needed.  Our 
method  is  developed  from  the  normalized  cut  problem  and  uses  the  Ginzburg-Landau  functional. 

This  problem  has  been  a  popular  one  to  be  studied.  Here,  we  emphasize  work  that  at  least  in¬ 
directly  deals  with  solving  the  cut  problem  (with  some  normalization  factor).  In  [18],  the  authors 
present  a  generalized  version  of  spectral  clustering  using  the  graph  p-Laplacian.  They  show  that,  in 
a  certain  limit,  the  cut  resulting  from  thresholding  the  second  eigenvector  of  the  graph  p-Laplacian 
is  the  solution  to  the  Cheeger  cut  problem.  An  efficient  scheme  to  calculate  the  eigenvector  is 
introduced.  In  [63],  Hein  et  al.  show  that  some  constrained  optimization  problems  can  be  formu¬ 
lated  as  nonlinear  eigenproblems.  The  authors  then  describe  a  generalization  of  the  inverse  power 
method  which  converges  to  nonlinear  eigenvectors.  This  method  is  applied  to  spectral  clustering 
and  sparse  principal  component  analysis.  Recent  work  by  Bresson,  Szlam,  Laurent,  von  Brecht,  et 
al.  includes  several  important  papers  related  to  this  area.  In  [100],  Szlam  et  al.  give  a  continuous 
relaxation  of  the  Cheeger  cut  problem  on  a  weighted  graph.  In  particular,  they  show  the  equiva¬ 
lence/relationship  between  the  total  variation  problem  and  the  Cheeger  cut  problem.  An  algorithm 
based  on  the  split-Bregman  method  [55]  is  developed  to  minimize  the  proposed  energy.  Authors 
of  [12]  present  two  algorithms  solving  the  relaxed  Cheeger  cut  problem.  They  also  prove  conver¬ 
gence  results  for  these  algorithms.  The  first  algorithm  is  a  novel  steepest  descent  approach  and 
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the  second  one  is  a  modified  inverse  power  method.  In  [14],  Bresson  et  al.  describe  an  adaptive 
version  of  the  method  shown  in  [12],  the  goal  of  which  is  to  compute  the  solution  of  the  relaxed 
Cheeger  cut  problem.  This  is  achieved  via  a  new  adaptive  stopping  condition.  The  result  is  an  algo¬ 
rithm  that  is  monotonic  and  much  more  efficient  than  before.  Multiclass  extensions  have  also  been 
proposed.  One  approach  is  to  use  recursive  extensions  and  thus  solve  a  collection  of  binary  prob¬ 
lems.  However,  other  approaches  have  been  introduced.  The  authors  of  [13]  present  a  framework 
for  multi  class  total  variation  clustering  that  does  not  use  recursion.  They  formulate  the  Cheeger 
energy  in  a  multi  class  setting,  and  the  relax  the  energy  in  a  continuous  setting.  This  results  in  an 
optimization  problem  involving  total  variation,  which  is  then  solved  using  the  proposed  proximal 
splitting  algorithm.  In  [15],  an  extension  of  the  result  of  [100]  is  introduced.  The  extension  deals 
with  multiple  classes  and  learning  from  these  classes  using  a  set  of  labels.  The  method  is  made 
even  more  efficient  by  the  usage  of  fast  L 1  solvers,  designed  for  the  total  variation  semi-norm. 
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CHAPTER  2 


Background 


2.1  Graphical  Framework 

For  all  methods,  we  consider  a  graphical  framework  with  an  undirected  graph  G  =  (V,  E), 
where  V  is  the  set  of  vertices  and  E  is  the  set  of  edges.  Let  w  be  the  weight  function  (defined  on 
the  set  of  edges)  which  measures  the  similarity  between  each  two  vertices.  Also,  let 

d(x)  =  ^2w(x,y) 

y&V 

be  the  degree  of  vertex  x.  The  diagonal  matrix  D  contains  the  degree  along  its  diagonal  entries 
and  the  matrix  W  contains  values  of  the  weight  function.  Both  matrices  are  of  dimension  n  by  n, 
where  n  is  the  number  of  vertices.  A  representation  of  the  graphical  framework  is  shown  below. 


One  advantage  of  using  a  graphical  framework  is  that  it  allows  one  to  be  non-local  and  take 
into  account  the  relationship  between  any  two  nodes  in  the  data  set.  Therefore,  repetitive  structure 
and  texture  can  be  captured.  The  graphical  framework  is  also  more  general,  and  can  be  easily 
constructed  for  any  data  set. 
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Weight  Function 

When  choosing  a  weight  function,  the  goal  is  to  give  a  large  weight  to  an  edge  if  the  two  vertices 
it  is  connecting  are  similar  and  a  small  weight  if  they  are  dissimilar.  One  popular  choice  for  the 
weight  function  is  the  Gaussian 

d(x,y 

w(x,y)  =  e  ,  (1) 

where  d(x,  y )  is  some  distance  measure  between  the  two  vertices  x  and  y,  and  a  is  a  parameter 
to  be  chosen.  For  example,  if  the  data  set  consists  of  points  in  M2,  d{x,  y)  can  be  the  Euclidean 
distance  between  point  x  and  point  y,  since  points  farther  away  are  less  likely  to  belong  to  the  same 
cluster  than  points  closer  together.  For  images,  d(x,  y)  can  be  defined  as  the  weighted  2-norm  of 
the  difference  of  the  feature  vectors  of  pixels  x  and  y,  where  the  feature  vector  of  a  node  consists 
of  intensity  values  of  pixels  in  its  neighborhood,  as  described  in  [52]. 

Another  choice  for  the  similarity  function  used  in  this  work  is  the  Zelnik-Manor  and  Perona 
weight  function  [90]  for  sparse  matrices: 

_  d(x,y)2 

w(x,  y)  —  e  Vt(*)t(v)  5  (2) 

where  the  local  parameter  t{x)  =  d(x,  z)2,  and  z  is  the  Mth  closest  vertex  to  vertex  x. 

Note  that  it  is  not  necessary  to  use  a  fully  connected  graph  setting,  which  might  be  a  compu¬ 
tational  burden.  Specifically,  the  fully  connected  graph  can  be  approximated  by  a  much  smaller 
graph  by  only  including  an  edge  between  vertex  x  and  y  if  x  is  a  A  - nearest  neighbor  of  y  or  vice 
versa.  This  is  called  a  k-nearest  neighbor  graph.  One  can  also  create  a  mutual  k-nearest  neighbor 
graph  by  only  including  an  edge  between  x  and  y  if  both  of  them  are  A  -ncarcst  neighbors  of  each 
other.  In  this  paper,  we  make  use  of  such  an  approximation;  our  edge  set  includes  only  edges 
between  vertices  that  are  near  to  each  other.  If  two  vertices  x  and  y  are  not  connected  by  an  edge, 
then  the  weight  between  them  is  set  to  0. 

Graph  Laplacian 

In  the  graphical  framework,  it  is  possible  to  introduce  some  common  mathematical  operators 
in  a  graphical  setting.  For  this  section,  we  will  only  be  concerned  with  the  graph  version  of 
the  differential  Faplace  operator.  Although  many  versions  exist,  we  mention  the  following  three 
matrices  that  are  related  to  the  differential  A  operator: 
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*  L  =  D  —  W,  unnormalized  Laplacian 


*  Ls  =  D~2LD“2,  symmetric  Laplacian 

*  Lrw  =  D  2L,  random  walk  Laplacian 

The  last  two  matrices  represent  normalized  versions  of  the  original  Laplacian,  as  it  is  sometimes 
desirable  to  scale,  especially  in  high  dimensions.  Note  that  we  have  the  following  equations: 


L u(x)  =  ^  w{x,y){u{x )  ~u(y)), 


L  su(x)  =  ~^—^2w(x,y)(u(x)  —  u(y)), 

^  y 

T  .  .  1  ,  x  /  u(x)  u(y)  , 

L~»H  =  -7=  £«>(*,  !/)(-;=  -  ~7Ts)- 


vA 


Vd(x)  Vd(y)' 


The  graph  Laplacian  L  has  the  following  easily  shown  properties: 

1)  L  is  symmetric  and  positive  semi-definite. 

2)  L  has  n  non-negative,  real-valued  eigenvalues  0  =  Ai  <  A2  <  Xn. 

3)  The  smallest  eigenvalue  of  L  is  0;  eigenvector  is  just  a  constant  vector. 

The  graph  Laplacians  Ls  and  Lrw  have  the  following  easily  shown  properties: 

1)  Ls  and  Lrw  are  positive  semi-definite. 

2)  Ls  and  Lrw  have  n  non-negative,  real-valued  eigenvalues  0  =  Ai  <  A2  <  An. 

3)  A  is  an  eigenvalue  of  Lrw  with  eigenvector  u  if  and  only  if  A  is  an  eigenvalue  of  Ls  with 
eigenvector  w  =  D^u. 

4)  The  smallest  eigenvalue  of  Ls  and  Lrw  is  0. 

It  is  also  worthwhile  to  mention  that  the  multiplicity  of  eigenvalue  0  equals  the  number  of 
connected  components  in  the  graph. 

Other  Graph  Operators 

Another  important  operator  that  arises  from  the  need  to  define  variational  methods  on  graphs 
is  the  mean  curvature  on  graphs.  This  non-local  operator  was  introduced  by  Osher  and  Shen 
in  [89],  who  defined  it  via  graph  based  p-Laplacian  operators.  /;- Lap  lace  operators  are  a  family  of 
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quasilinear  elliptic  partial  differential  operators  defined  for  1  <  p  <  oo: 

Lp{u)  =  V  ■  (|  V«  |p"2  Vu) 

In  the  special  case  p  —  2,  the  p-Laplacian  is  just  a  regular  Laplacian.  For  p  —  1,  the  p-Laplacian 
represents  curvature. 

The  discrete  graph  version  of  p-Laplace  operators  is  defined  in  [36]  as: 

L p(u(x))  =  ~  w(x,?/)(||Vm(x)||p_2+  \\Vu(y)\\p-2)(u(x)  -u(y)). 

^  (x,y)£E 

Note  that  the  graph  2-Laplacian  is  just  the  graph  Laplacian,  which  is  consistent  with  the  continuous 
case. 

Let  us  now  define  the  mean  curvature  on  graphs-  the  discrete  analog  of  the  mean  curvature  of 
the  level  curve  of  a  function  defined  on  a  continuous  domain  of  M;V: 

= l  (  '"<*-»)  (jjv^bi + - uiv))- 

Note  that  in  the  case  of  unweighted  mesh  graph,  nw  becomes  a  numerical  discretization  of  mean 
curvature.  This  curvature,  kw,  is  also  used  in  [30]  as  a  regularizer  in  a  graph  adaptation  of  the  Chan- 
Vese  method.  In  their  work  in  progress  [104],  Van  Gennip  et  al.  propose  a  different  definition  of 
mean  curvature  on  graphs  and  prove  convergence  of  the  MBO  scheme  on  graphs. 

Previous  Work  Using  Graphs 

Graph-based  formulations  have  been  used  extensively  for  image  processing  applications  [10, 
26,  27, 36,  56,  57, 59, 74].  Interesting  connections  between  these  different  algorithms,  as  well  as 
between  continuous  and  discrete  optimizations,  have  been  established  in  the  literature.  Grady  has 
proposed  a  random  walk  algorithm  [57]  that  performs  interactive  image  segmentation  using  the 
solution  to  a  combinatorial  Dirichlet  problem.  Elmoataz  et  al.  have  developed  generalizations  of 
the  graph  Laplacian  [36]  for  image  denoising  and  manifold  smoothing. 

Couprie  et  al.  in  [26]  define  a  conveniently  parameterized  graph-based  energy  function  that  is 
able  to  unify  graph  cuts,  random  walker,  shortest  paths  and  watershed  optimizations.  There,  the 
authors  test  different  seeded  image  segmentation  algorithms,  and  discuss  possibilities  to  optimize 
more  general  models  with  applications  beyond  image  segmentation.  In  [27],  alternative  graph- 
based  formulations  of  the  continuous  max-flow  problem  are  compared,  and  it  is  shown  that  not 


all  the  properties  satisfied  in  the  continuous  setting  carry  over  to  the  discrete  graph  representation. 
For  general  data  segmentation,  Bresson  et  al.  in  [12],  present  rigorous  convergence  results  for 
two  algorithms  that  solve  the  relaxed  Cheeger  cut  minimization,  and  show  a  formula  that  gives  the 
correspondence  between  the  global  minimizers  of  the  relaxed  problem  and  the  global  minimizers 
of  the  combinatorial  problem. 

2.2  Graphical  Framework,  Extended 

For  the  MBO  method,  we  only  need  to  define  the  Laplace  operator  in  a  more  general  graphical 
framework,  since  this  is  the  only  operator  encountered  in  the  procedure.  For  each  of  the  versions 
of  the  method  developed  in  Chapter  4,  however,  we  need  to  consider  more  operators,  and  thus 
we  outline  the  graphical  framework  in  more  detail  here,  giving  more  general  definitions  for  other 
operators.  We  define  operators  on  graphs  in  a  similar  fashion  as  done  in  [62,  103],  where  the 
justification  for  these  choices  is  shown. 

^  ^  m(m—  1) 

Assume  m  is  the  number  of  vertices  in  the  graph  and  let  V  =  Wn  and  S  =  M  2  be  Hilbert 
spaces  (associated  with  the  set  of  vertices  and  edges,  respectively)  defined  via  the  following  inner 
products: 

(u,l)v  =  ^u(x)~f(x)d(x)r, 

X 

|  _ 

($,<!>)  e  =  2  ^2'P(x,y)(j)(x,y)w(x,yj2q~1 
x,y 

for  some  r  G  [0, 1]  and  q  G  [|,  1].  Let  us  also  define  the  following  norms: 

MWe  =  \ZWWs  =  *  fc'52<l>(x,y)2u>(x,y)2q~1, 

V  X>V 

IHkoc  =  max  \  4>(x,  y)\. 
x,y 

The  gradient  operator  V  :  V  — >  £  is  then  defined  as: 

(Vu)w(x,  y )  =  w(x,  y)1~q(u(y)  -  «(x)).  (3) 

The  Dirichlet  energy  does  not  depend  on  r  or  q: 

x,y 
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(4) 


The  divergence  div  :  £  — >  V  is  defined  as  the  adjoint  of  the  gradient: 

(div«)  <t>)(x)  =  2(j^xy  VYW1’  y )  “  0(s/» x))> 

where  we  define  the  adjoint  using  the  following  definition:  (Vu,  (l))e  =  —{u,  divu,  0)y. 

We  now  have  a  family  of  graph  Laplacians  Ar  =  divu,  V  :  V  — >  V: 

(aw-u)(x)  =  ^  (u(y)  -  u(x ))•  (5) 

Viewing  u  as  a  vector  in  Mm,  we  can  write 

-A wu  =  (D1_r  -  DrW )u. 

The  case  with  r  —  0  is  the  unnormalized  Laplacian 

L  =  D  -  W. 

However,  the  matrix  L  is  usually  scaled  to  guarantee  convergence  to  the  continuum  differential 
operator  in  the  limit  of  large  sample  size  [9] .  Although  several  versions  exist,  we  consider  two 
popular  versions  of  the  symmetric  Laplacian 

Ls  =  D  ^LD  ^  =  I  —  D  ^WD-^ 

and  the  random  walk  Laplacian  (r  =  1) 

Lrw  =  D-'L  =  I  -  D  XW. 

The  advantage  of  the  former  formulation  is  its  symmetric  property  which  allows  for  more  efficient 
implementations . 

A  family  of  anisotropic  total  variations  TVW  :  V  — »  R.  can  now  be  defined: 

TVw{u)  =  max  {(divw  0,  u)v  :  0  e  £,  ||0||£>oo  <  1}  =  ^  w(x,  y)q\u(x)  -  u(y) |.  (6) 

x.y 

Lastly,  in  this  section,  we  consider  the  following  graph-based  Ginzburg  Landau  functional: 

GL,M  =  ||Vu||!  +  i£H'(«(a:)). 

X 

Remark.  It  is  noted  in  [103]  that  although  the  first  term  in  the  continuous  Ginzburg -Landau 
functional 

e  J \Vu\2dx  +  -  J  W{u)dx 


10 


is  scaled  by  e,  the  first  term  of  GLe  contains  no  e.  This  occurs  because  the  Dirichlet  energy  in 
the  continuous  Ginzburg-Landau  functional  is  unbounded  for  functions  of  bounded  variation  and 
taking  on  two  values  of  the  minima  of  the  double-well  potential  (almost  everywhere ).  However,  the 
difference  terms  ofGLe  are  finite  even  in  the  case  of  binary  functions,  and  no  rescaling  of  the  first 
term  is  necessary. 

It  remains  to  choose  the  parameters  q  and  r.  We  choose  q  =  1  as  in  [103],  where  it  is  shown 
that  for  any  r,  TVW  is  the  I  - limit  (Gamma  convergence)  of  a  sequence  of  graph-based  Ginzburg- 
Landau  (GL)-type  functionals: 

P 

Theorem  1.  GLe  — >  GLq  as  e  — >  0,  where 

GLf(u)  =  ||  V«|||  +  -  W(u(x))  =  \  £  w(x,  y)(u(x)  -  u(y))2  +  -  £  W(  u(x)), 

6  Z  6 

x  x,y  x 

{TVw[u)with  q=l  for  u  s.t.  u(x )  G  {0, 1} 
oo  otherwise 

Proof  See  Theorem  3.1  of  [103].  ■ 

It  is  also  shown  in  the  paper  (specifically  Theorem  3.6)  that  the  addition  of  a  fidelity  term  is 
compatible  with  T -convergence .  Since  one  of  the  algorithms  we  compare  our  methods  to  involves 
the  Ginzburg-Landau  functional,  to  be  consistent,  we  use  the  above  definitions  with  q  —  1  in  our 
formulations. 

We  choose  r  —  1  because  it  results  in  a  normalized  random  walk  Laplacian  and  the  eigenvec¬ 
tors  as  well  as  the  corresponding  eigenvalues  of  the  matrix  can  be  efficiently  calculated.  Although 
the  random  walk  Laplacian  matrix  itself  is  not  symmetric,  spectral  graph  theory  described  in  [24] 
shows  that  the  eigenvectors  of  the  random  walk  Laplacian  can  be  directly  computed  from  knowing 
the  diagonal  matrix  D  and  the  eigenvectors  of  the  symmetric  graph  Laplacian  (which  is  a  sym¬ 
metric  matrix)  Ls.  In  particular,  A  is  an  eigenvalue  of  Lrw  with  eigenvector  u  if  and  only  if  A  is 
an  eigenvalue  of  Ls  with  eigenvector  w  =  D  2  u.  This  is  proved  by  multiplying  the  eigenvalue 
equation  Lrw u  =  Xu  by  D  ^  from  the  left  and  then  substituting  w  =  D  2  u,  obtaining  L  sw  =  A vj. 

We  take  advantage  of  this  property  by  calculating  the  eigenvalues  and  eigenvectors  of  the  sym¬ 
metric  graph  Laplacian  (since  symmetric  matrices  allow  for  more  efficient  implementations)  and 
then  using  this  information  to  calculate  the  same  for  the  random  walk  Laplacian. 

To  summarize,  we  use  the  above  operator  definitions  with  q  —  1  and  r  =  1. 
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Therefore,  we  use 


TVw(u)  =  max{(di vw<t>,u)v  :<f>e£,  \\</>\\£t00  <  l}  =  ^  ^2  w(x,y)\u(x)  -  u(y) |.  (7) 

x.y 

In  this  section,  we  use  the  notation  u(x)  to  denote  the  value  of  u  at  node  x  G  V  that  provides 
information  about  the  class  membership  of  the  node.  Specifically,  we  use  u(x)  =  0  to  denote  the 
fact  that  node  x  belongs  to  class  1,  and  u(x)  =  1  to  denote  that  it  belongs  to  class  2. 


2.3  Optimization 

Constrained  and  Unconstrained  Minima 

This  section  will  focus  on  the  connection  between  constrained  and  unconstrained  minima.  Con¬ 
sider  the  problem  of 

min  f(x) 

X 

under  the  constraint 

g(x)  =  0. 

Suppose  that  this  minimum  is  achieved  at  x0.  By  the  theory  of  Lagrange  multipliers,  there  exists  a 
multiplier  A  such  that,  if  G  —  f  +  A g, 


G'(x  0)  =  0,  g(x  0)  =  0, 


G"(x0,h)  =  5]  I 


i,j= 1 


d2G 

dx'dx'J 


)tihj  >  0 


for  all  h  ^  0  such  that 


g'(x0,h )  =  g\x0)  ■  h  =  0. 

A  point  y  is  considered  non-singular  if 


d2G  dg 

dx‘ldx:>  dxl 

99_  o 

dxi  u 


is  7^  0  at  y. 

If  xq  is  a  nonsingular  minimum  point  for  /  subject  to  g  =  0,  we  see  that  G"(x o,  h )  =  0  if  and 
only  if  h  =  0,  and  otherwise  it  is  positive.  Therefore,  there  exists  a  positive  number  c  such  that 


G"(x o,  h)  +  c(g'(x o,  h))2  >  0 
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for  all  h  ^  0.  Let 


F  —  f  +  Xg  +  -eg2  —  G  +  —  c^2. 

Then  we  have  the  following  equations: 

F\x  o)  =  G\x  o)  =  0, 

F"(xo-,  h )  =  G"(i0,  /i)  +  c(c/(£0,  /i))2  >  0,  h  ±  0. 

Therefore,  we  have  the  following  observation: 

Note:  If  x0  is  a  nonsingluar  minimum  of  /  subject  to  the  constraint  g  —  0,  there  exists  a 
multiplier  A  and  a  constant  c  such  that  xq  is  an  unconstrained  local  minimum  of  the  function 

F  =  f  +  Xg+  ^ eg 2. 

We  also  have  a  converse  statement:  if  g(x 0)  =  0  and  x()  is  a  minimum  of  such  F,  then  x0  is  a 
minimum  of  /  subject  to  the  constraint  g  =  0. 

This  idea  is  used  in  the  augmented  Lagrangian  method,  to  be  discussed  shortly. 

Penalty  Methods 

A  popular  method  to  find  a  constrained  minimum  of  a  function  is  the  method  of  penalty  func¬ 
tions.  Suppose  one  seeks  the  minimum  of  function  / 

min  f(x) 

X 

under  the  constraint  g(x)  =  0.  The  penalty  method  is  an  iterative  procedure  that,  at  iteration  n, 
seeks  the  minimum  point  xn  of  the  function 

Fn(x)  =  f(x)  +  ^ ng2(x ). 

The  limit  of  the  sequence  xn,  if  it  exists,  is  the  constrained  minimum  of  function  /.  Note  also  that 

0  =  F'n(xn)  =  f'(xn)  +  ng{xn)g'{xn). 

Considering  the  above,  \n  =  ng(xn )  converges  to  a  Lagrange  multiplier  A  (see  previous  section) 
for  the  constrained  minimum  of  /.  However,  this  method  has  a  disadvantage  as  it  becomes  sensitive 
to  round-off  errors  due  to  the  ng2{x)  term  since  it  is  difficult  to  obtain  an  accurate  representation 
of  xn  for  large  values  of  n.  To  circumvent  this  difficulty,  we  present  the  augmented  Lagrangian 
method. 


13 


Augmented  Lagrangian  Method 

Consider  the  problem  of  finding  the  minimum  point  of  /  subject  to  the  constraint  g(x)  =  0. 
One  version  of  the  augmented  Lagrangian  method  is  an  iterative  procedue  that,  at  interation  n, 
seeks  the  minimum  point  xn  of  the  function 

K(x)  =  f(x )  +  A  ng(x)  +  ^cg2(x). 

The  limit  of  the  sequence  xn,  if  it  exists,  is  the  constrained  minimum  of  the  function  /.  To  derive 
the  update  equation  for  A,  consider  the  following  equation: 

F\xn)  =  f\xn)  +  (An  +  cg(xn))g\xn)  =  0. 

So,  one  can  consider  selecting 

An+1  =  An  +  cg(xn). 

In  our  case,  we  keep  c  constant.  However,  that  does  not  have  to  be  the  case,  and  one  can  certainly 
derive  a  procedure  for  which  c  is  updated  at  each  step. 

Translating  this  problem  to  the  multi-constraint  case,  consider  the  problem  of  finding  the  min¬ 
imum  of  /  subject  to  constraints  gt(x)  =  0  for  i  —  1  :  N.  Similar  theory  can  be  formulated 
using 

F(x,  A)  =  f(x)  +  J2\igi(x)  +  ^c^gf{x) 

i  i 

with  the  A*  being  calculated  in  the  following  manner: 

A“+1  =  A”  +  cg,(xn). 

Therefore,  we  have  the  following  augmented  Lagrangian  method: 

Augmented  Lagrangian  Method:  Consider  the  problem  of  finding  the  minimum  of  /  subject  to 
constraints  gi(x)  =  0  for  i  —  1  :  N.  Let 

F(x,X)  =  f(x)  +  ^  A ^(x)  +  ^c^g^(x). 

i  i 

Starting  with  an  initial  guess  Aj1  for  i  —  1  :  N,  having  obtained  A”,  let  xn  be  the  minimum  of 
Fn  =  F(x,  An).  We  update  A  by 

Ar^A^  +  c^O. 
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As  n  increases,  xn  converges  the  actual  minima  of  /  subject  to  the  constraints. 

If  the  problem  is  to  find  the  maximum  of  /  instead,  we  update  A  by  the  following: 

A”+1  =  K  -  cg,(x„). 

This  can  be  easily  derived  from  the  previously  stated  theory,  since  we  now  use  the  function 

F(x,X)  =  f(x)  +  J2XMx)  ~  ^c^g2(x). 

i  i 

The  derivation  is  very  similar  to  the  one  already  shown.  Suppose  that  this  maximum  is  achieved  at 
xq.  By  the  theory  of  Lagrange  multipliers,  there  exists  a  multiplier  A  such  that,  if  G  —  f  +  A g, 

G\x  o)  =  0,  g(x  o)  =  0, 

«W)  =  £(^)^<o 

i,j= 1 

for  all  h  ^  0  such  that 

g'(x0,h )  =  g\x0)  -li  =  0. 

There  also  exists  a  positive  number  c  such  that 

G"(x o,  h)  +  c(g'(x o,  h))2  <  0. 

This  justifies  the  use  of  F. 

The  advantage  of  this  method  is  that  unlike  the  penalty  method,  the  function  Fn  doesn’t  contain 
a  large  constant  in  front  of  the  g2  term,  in  which  case  it  would  be  difficult  to  obtain  a  good  numerical 
approximation  of  the  minimum  for  large  n.  Instead,  the  two  terms  allow  the  multipliers  to  stay  a 
relatively  small  size.  However,  a  disadvantage  of  the  augmented  Lagrangian  method  is  that  one 
needs  to  have  an  initial  estimate  of  the  multipliers,  which  is  not  the  case  with  the  penalty  methods. 


Convex  Optimization  and  Lagrange  Duality 

Consider  the  following  problem: 


min  f(x) 

X 


subject  to 


hi(x)  <  0,  gi(x)  =  0 


for  i  —  1  :  N.  This  is  the  primal  problem.  Let  p*  denote  the  optimal  value  given  by  x*: 

p*  =  min  f(x). 

X 


(8) 
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Now  let 


L(x,  A,  n)  =  f(x)  +  ^2  A ihi(x )  +  22  Xi  >  0 

i  i 

and 

k(X,  p)  —  min  L(x,  X,  p),  A>0  ,  p. 

X 

We  have  the  following  inequality: 

k(X,fi)  <p*. 

This  is  because  k( A,  p)  <  L(x,  A,  /i)  <  /(x)  for  all  x  such  that  satisfy  the  constraints  (8). 

Now  we  introduce  the  dual  problem 

d*  =  max  k( A,  u), 

so  d*  is  the  optimal  value. 

Since  /c(A,  //)  <  p*,  we  have 

d*  <p*. 

This  is  called  the  notion  of  weakduality  and  it  always  holds.  The  notion  of  strong  duality  occurs 
when  d*  =  p*.  Strong  duality  does  not  always  occur,  but  it  is  true  in  the  case  when  convexity  is 
mixed  with  certain  conditions,  such  as  Slater  condition  or  the  KKT  condition. 

Now  suppose  we  don’t  include  the  equality  constraints.  We  can  write  the  primal  problem  as 

p*  =  minmaxLfx,  A), 

x  A>0 

since  maximizing  L(x,  A)  in  the  variable  A  gives  the  original  function  in  this  case.  By  definition, 
the  dual  problem  is 

d*  =  max  min  L(x,  A). 

A>0  x 

So  the  principle  of  weak  duality  states  that 

maxminLfx,  A)  <  minmaxLfx,  A). 

A>0  x  x  A>0 

We  thus  have  some  form  of  the  max-min  inequality.  The  principle  of  strong  duality  states  that 

max  min  L(x,  X)  =  min  max  L(x,  A). 

A>0  x  x  A>0 

In  further  sections,  we  include  algorithms  that  deal  directly  with  the  primal  problem,  but  also 
some  that  solve  the  dual  problem. 
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2.4  Clustering 


Considering  the  problem  of  finding  the  minimum  cut  (1)  as  way  to  cluster  a  set  into  two  parts. 
As  mentioned  earlier,  in  order  to  avoid  an  undesired  classification  of  an  isolated  vertex,  one  often 
needs  to  use  some  normalization.  In  particular,  we  mention  the  ratio  cut  and  the  normalized  cut. 


Ratio  Cut 


One  way  to  modify  the  problem  is  to  find  a  subset  S  of  V  such  that 

RatioCut(S,  S)  =  cut(S,  £)  +  t4t) 

is  minimized.  This  is  a  NP  hard  discrete  problem  [105].  One  way  to  simplify  it  would  be  to  allow 
the  solution  to  take  arbitrary  values  in  M.  This  leads  to  the  following  relaxed  RatioCut  problem: 


min (w,  Lit),  u  _L  1,  I  Ml2  =  n. 

ueRn 

The  fact  that  the  above  problem  obtains  a  real- valued  solution  instead  of  a  discrete-valued  solution 
is  emphasized.  To  solve  the  above  problem,  one  can  apply  the  Raleigh-Ritz  theorem,  and  the 
solution  is  given  by  the  second  eigenvector  of  the  Laplacian.  To  obtain  a  binary  solution,  one  can 
use  several  methods,  the  simplest  of  which  is  thresholding. 

Normalized  Cut 

If  we  let  vol(S)  =  J2xeSd(x)’  where  d(x)  represents  the  degree  of  vertex  x,  another  way  to 
modify  the  problem  is  to  find  a  subset  S  of  V  such  that 

Ncut(S,  S)  =  cut(S,  S)  (  *  .  +  1  ~  ) 

\vol{b)  vol(b)/ 

is  minimized.  This  is  yet  again  a  NP  hard  discrete  problem  [105].  We  simplify  it  by  allowing  the 
solution  to  take  arbitrary  values  in  M.  This  leads  to  the  following  relaxed  Ncut  problem: 


min(u.Ln),  Du  _L  1,  (u.Du)  =  vol(Y). 

u£Rn  '  ' 

To  solve  the  above  problem,  one  can  apply  the  Raleigh-Ritz  theorem,  and  the  solution  is  given  by 
the  second  eigenvector  of  the  random  walk  Laplacian.  Thresholding  can  be  used  to  obtain  a  binary 
solution. 
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Other  Normalizations 


Two  other  possible  normalizations  are 


N 1 


cut(S,  S ) 

min(vol(S),vol(S))  ’ 
)  cut(S ,  S ) 

minds'),  \S\) 


Spectral  Clustering 

We  have  seen  that  the  ratio  cut  and  the  normalized  cut  problems  can  be  formulated  in  a  con¬ 
tinuous  setting,  with  solutions  given  by  the  second  eigenvectors  of  the  Laplacian  (L  =  D  —  W) 
and  the  random  walk  Laplacian  (Lrw  =  D  1 L),  respectively.  For  binary  problems,  one  can  sim¬ 
ply  find  the  solution  by  thresholding  the  second  eigenvector  of  an  appropriate  Laplacian.  Now, 
suppose  there  are  k  clusters.  The  method  of  spectral  clustering  computes  the  k  clusters  in  the 
following  manner: 

1.  Formulate  the  data  set  in  a  graph  setting. 

2.  Compute  either  the  Laplacian  (L  =  D  —  W)  and  the  random  walk  Laplacian  (Lrw  =  D  1 L) 

3.  Compute  the  first  k  eigenvectors  (vi,v2,...,  Vk)  of  the  Laplacian  (or  the  random  walk  Lapla¬ 
cian) 

4.  Let  U  be  the  matrix  containing  the  vectors  Vi,  v2,...,  iy.  as  columns. 

5.  Cluster  the  rows  of  the  matrix  U  with  the  /r- means  algorithm  into  k  clusters. 

The  A- means  algorithm  for  k  clusters  proceeds  iteratively  by  first  choosing  k  means  and  then 
assigning  each  point  to  the  cluster  whose  mean  is  the  “closest”  to  the  point.  The  mean  of  each 
cluster  is  then  recalculated.  The  iterations  continue  until  there  is  little  change  from  one  iteration  to 
the  next. 
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CHAPTER  3 


MBO  Method 


3.1  Derivation  of  the  Method 


We  consider  the  general  problem  for  data  classification  and  image  processing: 


min  { E(u )  =  R{u )  +  F(u)\, 

U  L  J 


where  R(u )  is  the  regularization  functional  and  F  (a)  is  the  fidelity  term.  Here  u  is  a  function 
defined  on  the  space  of  the  data  set  and  describes  the  appropriate  characteristic.  For  example,  in  the 
case  of  the  problem  of  classification,  u  indicates  the  class  number.  In  the  case  of  image  inpainting, 
u  indicates  intensity  value  of  a  pixel.  Some  common  R(u )  regularization  term  examples  are 

*(“)  =  /  *“>  =  /  |v“|2<fa' 

Some  common  F(u)  fidelity  term  examples  are 

F{u)  =  [  X(x)\u  —  uo\dx,  F(u)  =  [  X(x)\u  —  uo\2dx. 


The  A  term  regulates  the  fidelity  region.  It  equal  to  0  on  the  non-fidelity  region  and  to  a  constant  on 
the  fidelity  region.  For  example,  in  the  case  of  image  inpainting,  the  fidelity  region  is  everything 
but  the  damaged  region.  In  the  case  of  classification,  the  fidelity  would  be  all  the  points  with  known 
classification  values.  The  resulting  functional  is  a  tradeoff  between  accuracy  in  the  classification 
of  given  labels  and  function  smoothness.  It  is  also  desirable  to  choose  R  to  preserve  the  sharp 
discontinuities  that  may  arise  in  the  boundaries  between  classes. 

To  formulate  the  MBO  method  for  classification  and  image  processing,  we  introduce  another 
example  of  the  regularization  term  R  (  u  ')  =  Ginzburg-Landau  functional: 

GLe(u)  —  -  j  |Vu|2<irH — J  W(u)dx,  (9) 

where  W(u )  =  ( u 2  —  l)2.  We  use  this  regularization  term  for  the  MBO  method.  We  also  use  the 
L2  fidelity  term. 
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Ginzburg-Landau  Functional 

The  classical  Ginzburg-Landau  (GL)  functional  was  originally  proposed  to  describe  physical 
phenomena  such  as  liquid-gas  transitions  and  superconductivity  [97].  Here,  u  is  a  scalar  field 
defined  over  a  space  of  arbitrary  dimensionality  and  representing  the  state  of  the  phases  in  the 
system,  V  denotes  the  spatial  gradient  operator,  W (u)  is  a  double-well  potential,  such  as  W (u)  = 
\{u2  —  l)2  and  e  is  a  positive  constant. 

The  functional  (9)  is  composed  of  two  terms:  a  smoothing  term  that  measures  the  differences 
in  the  components  of  the  field,  and  a  potential  term  that  measures  how  far  each  component  is 
from  a  specific  value  (±1  in  the  example  above).  Consequently,  the  minimization  of  the  first 
term  leads  to  smoother  regions,  while  the  minimization  of  the  second  penalizes  variations  from 
the  minima  of  the  double- well  potential.  Given  initial  conditions  with  states  +1  and  states  —  1 
distributed  randomly  in  the  domain,  the  minimization  of  the  GL  functional  entails  an  inherent 
conflict  between  the  two  terms  in  the  functional,  leading  to  the  generation  of  a  transition  region: 
the  diffuse  interface.  The  smoothness  of  this  diffuse  interface  is  regulated  by  the  parameter  e. 
For  small  e,  the  state  minimizing  the  functional  contains  sharp  transitions  between  the  minima 
of  the  double-well  potential,  while  a  large  e  gives  more  weight  to  the  smoothing  term  so  that  the 
transitions  are  more  gradual. 

It  is  shown  in  [70]  that  the  e  — >  0  limit  of  the  GL  functional,  in  the  sense  of  T-convergence,  is 
the  Total  Variation  (TV)  semi-norm,  so  one  can  write: 

GLe(u)  — »r  IMItv, 

where 

\Wu\dx. 

Therefore,  the  justification  for  using  the  justification  for  using  the  Ginzburg-Landau  functional 
comes  from  the  fact  that  it  is  related  to  the  total  variation  seminorm,  which  has  been  used  suc¬ 
cessfully  in  many  image  processing  applications.  It  has  also  been  applied  to  numerical  analysis  of 
differential  equations  [60].  The  graph  version  of  this  result  was  shown  in  [103]. 

This  convergence  allows  the  two  functionals  to  be  interchanged  in  some  cases.  One  might 
prefer  to  use  the  GL  functional  instead  of  the  TV  semi-norm  since  its  highest  order  term  is  purely 
quadratic  which  allows  for  efficient  minimization  procedures.  It  also  allows  us  to  use  spectral 
mehtods  for  minimization.  In  contrast,  minimization  of  the  TV  semi-norm  leads  to  a  nonlinear 
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curvature  term,  making  it  less  trivial  to  solve  numerically.  However,  recent  advances,  such  as  the 
split  Bregman  method  described  in  [55],  have  made  progress  in  such  problems. 

An  analogous  convergence  property  has  recently  been  shown  in  the  case  of  graphs  as  well,  for 
binary  segmentations  [103].  Since  TV  is  an  /.'-based  metric,  TV-minimization  leads  to  sparse 
solutions,  namely  indicator  functions  that  closely  resemble  the  discrete  solution  of  the  original 
NP-hard  combinatorial  segmentation  problem  [100].  Thus,  the  GL  functional  actually  becomes 
an  L 1  metric  in  the  small  e  limit,  and  leads  to  sharp  transitions  between  classes.  Intuitively,  the 
convergence  of  GL  to  TV  holds  because  in  the  limit  of  a  vanishing  interface,  the  potential  takes 
precedence  and  the  graph  nodes  are  forced  towards  the  minima  of  the  potential,  achieving  a  con¬ 
figuration  of  minimal  length  of  transition.  This  is  contrast  to  more  traditional  spectral  clustering 
approaches,  which  can  be  understood  as  //“-based  methods  and  do  not  favor  sparse  solutions.  Fur¬ 
thermore,  while  the  smoothness  of  the  transition  in  the  GL  functional  is  regulated  by  e,  in  practice 
the  value  of  e  does  not  have  to  be  decreased  all  the  way  to  zero  to  obtain  sharp  transitions.  This 
capability  of  modeling  the  separation  of  a  domain  into  regions  or  phases  with  a  controlled  smooth¬ 
ness  transition  between  them  makes  the  diffuse  interface  description  attractive  for  segmentation 
problems,  and  distinguishes  it  from  more  traditional  graph-based  spectral  partitioning  methods. 

The  diffuse  interface  description,  with  its  controlled  smoothness  transition  between  phases,  is 
attractive  for  segmentation  problems  due  to  its  straightforward  way  of  modeling  the  separation  of  a 
domain  into  regions  or  phases.  It  has  been  used  successfully  in  image  impainting  [8,32]  and  image 
segmentation  [39]. 


Derivation 


By  using  the  Ginzburg-Landau  functional  for  the  regularization  term  and  the  L2  fidelity  term, 
we  obtain  the  following  minimization  problem: 


mm 

U 


in  \^E{u)  —  -  /  \X7u\2dx-\ —  /  W(u)dx+  /  \(x)\u  —  u$\2dx}. 


(10) 


The  energy  can  be  minimized  in  the  L2  sense  using  gradient  descent.  This  leads  to  the  following 

dynamic  equation  ( modified  Allen-Cahn  equation ): 

du 
dt  : 


5GLt  5F  1..  .  5F 

- - q—  =  eAu  -  -W  (u)  -  q—, 

du  du  e  du 


(ID 


where  A  represents  the  Laplacian  operator.  A  local  minimizer  of  the  energy  is  obtained  by  evolving 
this  expression  to  steady  state.  Note  that  E  is  not  convex,  and  may  have  multiple  local  minima. 
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In  their  work  [9],  Bertozzi  and  Flenner  propose  a  segmentation  algorithm  for  solving  (5.1) 
in  a  graph  setting.  The  functional  is  minimized  using  the  method  of  gradient  descent  and  convex 
splitting.  The  main  purpose  of  thie  MBO  method  is  to  develop  a  more  efficient  and  simple  method 
for  minimizing  (5.1)  in  the  small  e  limit.  An  answer  comes  from  the  relation  between  the  Allen- 
Cahn  equation  and  the  motion  by  mean  curvature. 

Let  us  start  by  reviewing  this  connection  in  the  continuous  setting.  In  [81],  Merriman,  Bence 
and  Osher  propose  an  algorithm  to  approximate  motion  by  mean  curvature,  or  motion  in  which 
normal  velocity  equals  mean  curvature,  using  threshold  dynamics.  The  authors  note  that  if  one 
applies  the  heat  equation  to  an  interface,  then  the  diffusion  blunts  the  sharp  points  of  the  boundary, 
but  has  very  little  effect  on  the  flatter  regions.  Therefore,  one  can  imagine  that  diffusion  creates 
some  sort  of  motion  by  mean  curvature,  providing  that  we  specify  the  boundaries  of  the  moving 
set. 

Given  a  phase  field  u(x,  t),  consider  the  basic  (unmodified)  Allen-Cahn  equation,  namely  equa¬ 
tion  (11)  without  the  fidelity  term: 

du  1 

^-  =  eAu--W'(u).  (12) 

dt  e  v  ' 

For  small  values  of  e,  the  following  time-splitting  scheme  can  be  used  numerically  to  evolve  the 
Allen-Cahn  equation: 


1.  The  first  step  is  propagation  using: 


2.  The  second  step  is  propagation  using: 


du 

Tt  = fAu' 


du  l  , 

W  =  -W(u). 

Note,  however,  that  in  the  e  — >•  0  limit,  the  second  step  is  simply  thresholding  [81].  Thus,  as  e  — »  0, 
the  time  splitting  scheme  above  consists  of  alternating  between  diffusion  and  thresholding  steps. 

It  has  been  shown  [92]  that  in  the  limit  e  — >•  0,  the  rescaled  solutions  ue(z,t/e)  of  (12)  yield 
motion  by  mean  curvature  of  the  interface  between  the  two  phases  of  the  solutions.  This  motivates 
the  two  sequential  steps  of  the  MBO  scheme: 

1.  Diffusion.  Let  un+ i  =  S(St)un  where  S(dt)  is  the  propagator  (by  time  St)  of  the  standard 
heat  equation: 

du  . 

m  =  A“' 
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2.  Thresholding.  Let 

{1  if  un+  2  >  0, 

—  1  if  un+ ^  <  0. 

Barles  [5]  and  Evans  [40]  have  proven  rigorously  that  this  scheme  approximates  motion  by  mean 
curvature. 

Multiple  extensions,  adaptations  and  applications  of  the  MBO  scheme  are  present  in  literature. 
We  find  the  modification  of  the  MBO  scheme  for  solving  the  inhomogeneous  Allen-Cahn  equation 
proposed  in  [39]  particularly  interesting.  To  create  a  fast  image  segmentation  algorithm,  Esedoglu 
and  Tsai  propose  a  thresholding  scheme  for  minimizing  a  diffuse  interface  version  of  the  piecewise 
constant  Mumford-Shah  functional 

MSe(u,  ci,c2)  =  [  e\Vu\2  +  -W(u)  +  A{w2(ci  -  f)2  +  (1  -  m)2(c2  -  f)2}dx,  (13) 

J  D  e 

where  /  is  the  image.  The  first  variation  of  the  model  (13)  yields  the  following  gradient  descent 
equation: 

ut  =  2eA u  -  ^-W'(u)  +  2A{w(ci  -  f)2  +  (1  -  u)(c2  -  f)2} 

and  the  adaptation  of  the  MBO  scheme  is  used  to  solve  it.  Esedoglu  and  Tsai  propose  the  following 
scheme  (similar  to  the  MBO  scheme  where  the  propagation  step  based  on  the  heat  equation  is 
combined  with  thresholding): 

*  Step  1  Let  v(x )  —  S(8t)un{x)  where  S(St^)  is  3  prop3.g3.tor  by  time  St  of  the  ec[U3tion! 

wt  =  Aw-  2A  (w(ci  -  f)2  +  (1  -  w)(c2  -  / )2) 


with  appropriate  boundary  conditions. 


*  Step  2  Set 


U-n+l  (*^) 


0  if  v(x)  G  (—00,  \], 
1  if  v(x)  E  (|,  00). 


Some  other  extensions  of  the  MBO  scheme  appeared  in  [37, 38, 83].  An  efficient  algorithm  for 
motion  by  mean  curvature  using  adaptive  grids  was  proposed  in  [93]. 

The  motion  by  mean  curvature  of  the  MBO  scheme  can  be  generalized  to  the  case  of  functions 
on  a  graph  in  much  the  same  way  as  the  procedure  followed  for  the  modified  Allen-Cahn  equation 
(11).  We  now  use  the  same  ideas  and  apply  a  two-step  time  splitting  scheme  to  (11)  so  that  the 
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second  step  is  the  same  as  the  one  in  the  original  MBO  scheme.  The  idea  is  then  to  replace  all 
the  operators  with  a  more  general  graph  term,  since  we  are  considering  the  graphical  framework. 
The  only  operator  to  deal  with  here,  is  the  A  operator,  and  we  can  replace  it  by  several  different 
versions  of  the  graph  Laplacian.  In  the  graphical  framework,  we  have  the  following  three  versions 
that  are  related  to  the  differential  A  operator: 

*  L  =  D  —  W,  unnormalized  Laplacian 

*  Ls  =  D  ^LD symmetric  Laplacian 

*  Lrw  =  D  XL,  random  walk  Laplacian 

Since  L.s  is  a  symmetric  matrix,  we  use  the  symmetric  Laplacian,  and  thus  replace  A u  by 
— L su.  This  connection  is  to  be  explained  in  Section  10. 

The  discretized  version  of  the  algorithm  is: 


Binary  MBO  Algorithm: 


Initialize  u.  Until  convergence,  alternate  between  the  following  two  steps: 


1.  Heat  equation  with  forcing  term: 

un+\  —  u11 
dt 


— L  sun  —  n(un  —  u). 


2.  Thresholding: 


jl,  if  >  0, 

[-1,  if«;+i<o. 


(14) 


Here,  after  the  second  step,  u™  can  take  only  two  values  of  1  or  —1;  thus,  this  method  is  appro¬ 
priate  for  binary  segmentation.  Note  that  the  fidelity  term  scaling  can  be  different  from  the  one  in 
(11). 


The  first  part  of  the  two  step  scheme  is  solved  using  the  spectral  decomposition  of  the  symmetric 
graph  Laplacian.  Let  un  =  Yhk^^kix)  and  Ci\(un  —  u0 )  =  Ylk(^k(t)k{x),  where  4>(x)  are  the 
eigenfunctions  of  the  symmetric  Laplacian.  Using  the  obtained  representations  and  equation  (30), 
we  obtain 


a 


71+ 1 

k 


-  dtd 
1  +  dt\k 
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where  Xk  are  the  eigenvalues  of  the  symmetric  graph  Laplacian. 

In  practice,  it  can  be  productive  to  repeat  the  diffusion  step  a  number  of  times  before  thresh¬ 
olding.  In  order  to  keep  the  convention  that  one  iteration  of  the  diffusion-thresholding  procedure 
corresponds  to  one  time  step,  we  divide  dt  by  the  number  of  diffusion  steps  per  iteration,  which 
we  denote  Ns. 

To  compute  the  eigenvalues  and  eigenvectors  of  the  graph  Laplacian,  we  use  two  methods.  One 
of  them  is  the  Raleigh-Chebyshev  procedure  of  [1]  and  the  second  one  is  the  Nystrom  extension 
[9,44,45].  See  the  Appendix  for  a  brief  description  of  the  latter  method. 

3.2  MBO  Method  Procedure 

Here  are  the  steps  of  the  MBO  method  in  finding  a  binary  minimizer  u. 

*  Create  a  graph  from  the  data,  choose  a  weight  function  and  then  calculate  the  symmetric 
graph  Laplacian. 

*  Calculate  the  eigenvectors  and  eigenvalues  of  the  symmetric  graph  Laplacian. 

*  Initialize  u. 

*  Find  or  set  the  fidelity  region. 

*  Apply  the  two-step  scheme  described  earlier  until  a  stopping  criterion  is  satisfied. 

The  final  u  will  be  binary.  Changes  for  multiclass  case  will  be  discussed  in  the  next  section. 

3.3  Extension  to  the  Multiclass  Case 

Given  ND  data  points,  we  generalize  the  label  vector  u  to  a  label  matrix  U  =  (ux, . . . ,  u  v,  j7  - 
Rather  than  node  i  adopting  a  single  state  u,  e  M,  it  now  adopts  a  composition  of  states  expressed 
by  a  vector  u,  e  MA  where  the  A  th  component  of  u,  is  the  strength  with  which  it  takes  on  class  k. 
The  matrix  U  has  dimensions  No  x  K,  where  K  is  the  total  number  of  possible  classes. 

For  each  node  i,  we  require  the  vector  u,  to  be  an  element  of  the  Gibbs  simplex  SA ,  defined  as 

K 

y]  xk  =  i 

fc=i 

Vertex  k  of  the  simplex  is  given  by  the  unit  vector  ek,  whose  A  th  component  equals  1  and  all  other 
components  vanish.  These  vertices  correspond  to  pure  phases,  where  the  node  belongs  exclusively 


£A  :=  (aq, . . .  ,xK)  e  [0, 1] 


K 
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to  class  k.  Note  that  the  simplex  formulation  has  a  straightforward  probabilistic  interpretation,  with 
Uj  representing  the  probability  distribution  over  the  K  classes.  In  other  segmentation  algorithms, 
such  as  spectral  clustering,  these  real-valued  variables  can  have  different  interpretations  that  are 
exploited  for  specific  applications,  as  discussed  in  [59,74]. 


3.3.1  Version  1:  The  MBO  Method  Extension  (Multiclass  MBO) 


Using  the  standard  Gibbs-simplex  SA  just  defined,  the  multiclass  extension  of  the  algorithm 
in  [80]  is  straightforward.  The  notation  is  the  same  as  in  the  previous  section:  we  use  a  matrix  U  to 
represent  the  phase  composition  of  nodes.  The  second  step  of  the  algorithm  is  modified,  however, 
so  that  the  thresholding  is  converted  to  the  displacement  of  the  vector  field  variable  towards  the 
closest  vertex  in  the  Gibbs  simplex.  In  other  words,  now  in  the  second  step  the  row  vector  u;;n+2 
of  step  1  is  projected  back  to  the  simplex  (using  the  approach  outlined  in  [23]  as  before)  and  then  a 
pure  phase  given  by  the  vertex  in  the  SA  simplex  closest  to  Ujn+5  is  assigned  to  be  the  new  phase 
composition  of  node  i. 

In  summary,  the  new  algorithm  consists  of  alternating  between  the  following  two  steps  to  obtain 
approximate  solutions  U”  at  discrete  times: 


1.  Heat  equation  with  forcing  term: 

jjn+l  _  Xjn 

dt 


— LsUn  -  fi( Un  -  U). 


2.  Thresholding: 


Uj 


n+1 


where  vertex  is  the  vertex  in  the  simplex  closest  to  projectToSimplex(ui 


(15) 


As  for  the  multiclass  GL  algorithm,  when  a  label  is  known,  it  is  represented  by  the  corresponding 
vertex  in  the  SA  simplex.  The  final  classification  is  achieved  by  assigning  node  i  to  class  k  if  if 
the  kth  component  of  u,  is  one.  Again,  as  in  the  binary  case,  the  diffusion  step  can  be  repeated 
a  number  of  times  before  thresholding  and  when  that  happens,  dt  is  divided  by  the  number  of 
diffusion  iterations  Ng. 

As  in  the  previous  section,  we  use  an  implicit  numerical  scheme.  For  the  MBO  algorithm,  the 
procedure  involves  modifying  (19)  to  apply  Ls  to  U"+2  instead  of  to  Un.  This  gives  the  diffusion 
step 


U'i+i 


B 


U* 


dt  n{Un 
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where 


B  =  I  +  dt  Ls. 

As  before,  we  use  the  eigendecomposition  Ls  =  XAXr  to  write 

B  =  X(I  +  dt  A)  XT, 

which  we  approximate  using  the  first  Ne  eigenfunctions. 

For  initialization,  the  phase  compositions  of  the  fidelity  points  are  set  to  the  vertices  of  the 
simplex  corresponding  to  the  known  labels,  while  the  phase  compositions  of  the  rest  of  the  points 
are  set  randomly. 

The  energy  minimization  proceeds  until  a  steady  state  condition  is  reached.  Once  the  change  of 
the  norm  of  the  vector  field  in  subsequent  iterations  falls  below  a  threshold,  the  system  is  no  longer 
evolving  and  the  energy  decrement  is  negligible.  Consequently,  the  calculation  is  stopped  when 

maxj  ||ujn+1  —  Ujn||2 

- II  n+iii2 -  <  V’ 

max*  ||Ujn+i[|z 

where  77  represents  a  given  small  positive  constant.  The  final  classes  are  obtained  by  assigning 
class  k  to  node  i  if  u,  is  closest  to  vertex  ek  on  the  Gibbs  simplex. 

The  multiclass  MBO  algorithm  is  summarized  in  Figure  1.  Its  complexity  is  0(NDKNeNg ) 
operations  for  the  main  loop,  0(Nr>K  log  K)  operations  for  the  projection  to  the  simplex  and 
()(NdK)  operations  for  thresholding.  As  for  the  multiclass  GL  algorithm,  Ne  <A  No  and  K  <C 
No-  Furthermore  Ns  needs  to  be  set  to  three,  and  due  to  the  thresholding  step,  we  find  that 
extremely  few  iterations  (e.g.,  6)  are  needed  to  reach  steady  state.  Thus,  in  practice,  the  complexity 
of  this  algorithm  is  linear  as  well,  and  typical  runtimes  are  very  rapid  as  shown  in  Table  III. 

Note  that  graph  analogues  of  continuum  operators,  such  as  gradient  and  Laplacian,  can  be  con¬ 
structed  using  tools  of  nonlocal  discrete  calculus.  Hence,  it  is  possible  to  express  notions  of  graph 
curvature  for  arbitrary  graphs,  even  with  no  geometric  embedding,  but  this  is  not  straightforward. 
For  a  more  detailed  discussion  about  the  MBO  scheme  and  motion  by  mean  curvature  on  graphs, 
we  refer  the  reader  to  [104]. 

3.3.2  Version  2:  A  Ginzburg-Landau  Multiclass  Extension  (Multiclass  GL) 

Here  we  provide  a  different  version  of  the  multiclass  extension  that  does  not  use  the  MBO 
scheme. 
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Figure  1:  Multiclass  MBO  Algorithm 


Require:  dt,  ND,  Ne,  Ns,  K,  /r,  U,  A,  X 

Ensure:  out  =  Uend 

Y(-(i+|a)_1xt 

for  i  =  1  — »  ND  do 

Uj®  <r-  rand(( 0, 1)),  U;°  «-  projectToSimplex(ui°).  If  pt  >  0,  U^.  •<— 

end  for 

n  <T-  1 

while  Stop  criterion  not  satisfied  do 
for  s  =  1  — >  Ns  do 

Z  -s—  Y  |"lJ”  —  jN  /tt(Un  -  U) 

U«+t  xz 

end  for 

for  i  =  1  — y  Nd  do 

u;”+1  projectToSimplex(uin+1) 

Uin+1  <—  efc,  where  k  is  closest  simplex  vertex  to  U;n+1 

end  for 

n  <—  n  +  1 

end  while 


and  fi,  is  a  vector  indicating  prior  class  knowledge  of  sample  i.  We  set  ft,  =  ep.  if  node  i  is  known 
to  be  in  class  k. 

As  mentioned  before,  the  first  (smoothing)  term  in  the  GL  functional  (16)  measures  variations 
in  the  vector  field.  The  simplex  representation  has  the  advantage  that,  like  in  Potts-based  models 
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but  unlike  in  some  other  multiclass  methods,  the  penalty  assigned  to  differently  labeled  neighbors 
is  independent  of  the  integer  ordering  of  the  labels.  The  second  (potential)  term  drives  the  system 
closer  to  the  vertices  of  the  simplex,  with  the  use  of  an  Lx  norm  preventing  the  emergence  of  an 
undesirable  minimum  at  the  center  of  the  simplex,  as  would  happen  with  an  L  >  norm  for  large  K . 
This  potential  aims  to  provide  a  clear  way  to  calculate  class  memberships,  as  the  phase  composition 
is  purer  near  the  vertices  of  the  simplex.  The  compromise  between  the  smoothing  and  potential 
terms  is  established  through  the  constant  e.  The  third  (fidelity)  term  enables  the  encoding  of  a 
priori  information. 

Note  that  one  can  obtain  meaningful  results  without  fidelity  information  (unsupervised),  but 
the  methods  for  doing  so  are  not  as  straightforward.  One  example  is  a  new  TV-based  modularity 
optimization  method  [66]  that  makes  no  assumption  as  to  the  number  of  classes  and  can  be  recast 
as  GL  minimization.  Also,  while  T -convergence  to  TV  in  the  graph  setting  has  been  proven  for  the 
binary  segmentation  problem  [103],  no  similar  convergence  property  has  yet  been  proven  for  the 
multiclass  case.  We  leave  this  as  an  open  conjecture. 

Following  [9],  we  use  a  convex  splitting  scheme  to  minimize  the  GL  functional  in  the  phase 
field  approach.  The  energy  functional  (16)  is  decomposed  into  convex  and  concave  parts: 


E(  U)  -^convex  (u)  +  -^concave  (u), 

£convex(U)  =  |  (U,  LSU)  +  j  (U,  U) , 

l  K  l 

-^concave  (U)  =  77-3  ^  '  TT  ^  llUi  ~  efc||_Li 

i£V  k= 1 

+  Ef  ll«.-a.llL-f<u,u) 

iev 

with  Cel  denoting  a  constant  that  is  chosen  to  guarantee  the  convexity /concavity  of  the  energy 
terms.  Evaluating  the  second  derivative  of  the  partitions,  and  simplifying  terms,  yields: 


C  >p  +  -.  (17) 

e 

The  convex  splitting  scheme  results  in  an  unconditionally  stable  time-discretization  scheme 
using  a  gradient  descent  implicit  in  the  convex  partition  and  explicit  in  the  concave  partition,  as 
given  by  the  form  [39,41, 110] 


U?k+1  +  dt 


SEr, 


ik 


(U-P)  =  u;i  -  dt 


SE, 


concave 


SUi 


(U, 


ik 


ik) 


29 


We  write  this  equation  in  matrix  form  as 


Un+1  +  dt  (eLsUn+1  +  CU"+1) 


=  U n-dt  T"  +  n(Vn  -  U)  -  CUnJ  , 

where 

K  1  K  1 

Tik  —  z  (1  —  2 dki)  ||uj  —  e/||Ll  —  II Ui  —  em||Li , 

1=1  m= 1 

m^l 

li  is  a  diagonal  matrix  with  elements  /i,:,  and  U  =  (ui, . . . ,  uNd)t. 

Solving  (18)  for  Un+1  gives  the  iteration  equation 


Un+1  =  B  1 


(1  +  C  dt)  Un  -  ^  Tn  -  dt  XJn  -  U) 

Zj  C 


where 


B  =  (1  +  C  dt)  I  +  edtLs 


This  implicit  scheme  allows  the  evolution  of  U  to  be  numerically  stable  regardless  of  the  time 
step  dt,  in  spite  of  the  numerical  “stiffness”  of  the  underlying  differential  equations  which  could 
otherwise  force  dt  to  be  impractically  small. 

In  general,  after  the  update,  the  phase  field  is  no  longer  on  the  SA  simplex.  Consequently,  we 
use  the  procedure  in  [23]  to  project  back  to  the  simplex. 

Computationally,  the  scheme’s  numerical  efficiency  is  increased  by  using  a  low-dimensional 
subspace  spanned  by  only  a  small  number  of  eigenfunctions.  Let  X  be  the  matrix  of  eigenvec¬ 
tors  of  Ls  and  A  be  the  diagonal  matrix  of  corresponding  eigenvalues.  We  now  write  Ls  as  its 
eigendecomposition  Ls  =  XAXT,  and  set 


B  =  X[(l  +  Cdt)I  +  edt  A]  XT, 


but  we  approximate  X  by  a  truncated  matrix  retaining  only  Ne  eigenvectors  (Ne  ND),  to  form 
a  matrix  of  dimension  ND  x  Ne.  The  term  in  brackets  is  simply  a  diagonal  Ne  x  Ne  matrix. 
This  allows  B  to  be  calculated  rapidly,  but  more  importantly  it  allows  the  update  step  (3.3.2)  to 
be  decomposed  into  two  significantly  faster  matrix  multiplications  (as  discussed  below),  while 
sacrificing  little  accuracy  in  practice. 

For  initialization,  the  phase  compositions  of  the  fidelity  points  are  set  to  the  vertices  of  the 
simplex  corresponding  to  the  known  labels,  while  the  phase  compositions  of  the  rest  of  the  points 
are  set  randomly. 
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Figure  2:  Multiclass  GL  Algorithm 


Require:  e,dt,Nu,Ne,K,fj,,\J,A,X. 

Ensure:  out  =  Uend 

C  d  +  7 

Y  •<—  [(1  +  C  di)I  +  e  diA] _1  XT 


for  *  =  1  — >  Nd  do 

Uik  <r-  rand((0 , 1)),  U°k  projectTo Simple x(u,°).  If  /r*  >  0,  17,°  «-  Uik 

end  for 

n  1 

while  Stop  criterion  not  satisfied  do 
for  *  =  1  — >  Nd,  k  =  1  — >  K  do 

Tik  ‘S—  1  5  (1  “  ||u,;n  —  e;||Ll  rim=l,m^t  I  llU«"  “  em\\lJ1 


end  for 
Z^Y 


(1  +  C  dt )  U" 


Un+1  <-  XZ 


§  T"-df^(Un-U) 


for  I  =  1  — ►  ATd  do 

Ui”+1  i—  projectToSimplex(uin+1) 

end  for 

n  <—  n  +  1 

end  while 


The  energy  minimization  proceeds  until  a  steady  state  condition  is  reached.  The  final  classes 
are  obtained  by  assigning  class  k  to  node  i  if  u*  is  closest  to  vertex  e k  on  the  Gibbs  simplex. 
Consequently,  the  calculation  is  stopped  when 

max,  ||ujn+1  —  Ujn||2 

- II  n+1112 -  <  V, 

max,  ||Ujr,+i||/ 

where  q  represents  a  given  small  positive  constant. 

The  algorithm  is  outlined  in  Figure  2.  While  other  operator  splitting  methods  have  been  studied 
for  minimization  problems  (e.g.  [73]),  ours  has  the  following  advantages:  (i)  it  is  direct  (i.e.  it 
does  not  require  the  solution  of  further  minimization  problems),  (ii)  the  resolution  can  be  adjusted 
by  increasing  the  number  of  eigenvectors  Ne  used  in  the  representation  of  the  phase  field,  and 
(iii)  it  has  low  complexity.  To  see  this  final  point,  observe  that  each  iteration  of  the  multiclass 
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GL  algorithm  has  only  0(NoKNe )  operations  for  the  main  loop,  since  matrix  Z  in  Figure  2  only 
has  dimensions  Ne  x  K,  and  then  (){NDK  log  K)  operations  for  the  projection  to  the  simplex. 
Usually,  Ne  <U  ND  and  K  <U  ND,  so  the  dominant  factor  is  simply  the  size  of  the  data  set  ND.  In 
addition,  it  is  generally  the  case  that  the  number  of  iterations  required  for  convergence  is  moderate 
(around  50  iterations).  Thus,  practically  speaking,  the  complexity  of  the  algorithm  is  linear. 

Note  on  Previous  Work  on  Multiclass  Classification 

Not  all  the  methods  deal  directly  with  the  multiple  classes  in  the  data  set.  A  different  approach 
is  to  reduce  the  multiclass  case  to  a  series  of  two-class  problems  and  to  combine  the  sequence 
of  resulting  sub-classifications.  Strategies  employed  include  recursive  partitioning,  hierarchical 
classification  and  binary  encodings,  among  others.  For  example,  Dietterich  and  Bakiri  use  a  binary 
approach  to  encode  the  class  labels  [31].  In  [61],  a  pairwise  coupling  is  described,  in  which  each 
two-class  problem  is  solved  and  then  a  class  decision  is  made  combining  the  decisions  of  all  the 
subproblems.  Szlam  and  Bresson  present  a  method  involving  Cheeger  cuts  and  split  Bregman 
iteration  [55]  to  build  a  recursive  partitioning  scheme  in  which  the  data  set  is  repeatedly  divided 
until  the  desired  number  of  classes  is  reached.  The  latter  scheme  has  been  extended  to  mutliclass 
versions.  In  [15],  a  multiclass  algorithm  for  the  transductive  learning  problem  in  high-dimensional 
data  classification,  based  on  i1  relaxation  of  the  Cheeger  cut  and  the  piecewise  constant  Mumford- 
Shah  or  Potts  models,  is  described. 

Our  methods,  on  the  other  hand,  have  roots  in  the  continuous  setting  as  they  are  derived  via 
a  variational  formulation.  Alternative  variational  principles  have  also  been  used  for  image  seg¬ 
mentation.  In  [73],  a  multiclass  labeling  for  image  analysis  is  carried  out  by  a  multidimensional 
total  variation  formulation  involving  a  simplex-constrained  convex  optimization.  In  that  work,  a 
discretization  of  the  resulting  PDEs  is  used  to  solve  numerically  the  minimization  of  the  energy. 
A  convex  relaxation  procedure  is  proposed  and  applied  to  image  segmentation.  In  these  cases,  the 
discretization  corresponds  to  a  uniform  grid  embedded  in  the  Euclidean  space  where  the  domain  re¬ 
sides.  Similarly,  diffuse  interface  methods  have  been  used  successfully  in  image  impainting  [8,32] 
and  image  segmentation  [39]. 

While  our  algorithms  are  inspired  by  continuous  processes,  they  can  be  written  directly  in  a 
discrete  combinatorial  setting  defined  by  the  graph  Laplacian.  This  has  the  advantage,  noted  by 
Grady  [57],  of  avoiding  errors  that  could  arise  from  a  discretization  process.  We  represent  the 
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data  as  nodes  in  a  weighted  graph,  with  each  edge  assigned  a  measure  of  similarity  between  the 
vertices  it  is  connecting.  The  edges  between  nodes  in  the  graph  are  not  the  result  of  a  regular  grid 
embedded  in  an  Euclidean  space.  Therefore,  a  nonlocal  calculus  formulation  [52]  is  the  tool  used  to 
generalize  the  continuous  formulation  to  a  (nonlocal)  discrete  setting  given  by  functions  on  graphs. 
Other  nonlocal  formulations  for  weighted  graphs  are  included  in  [36],  while  [58]  constitutes  a 
comprehensive  reference  about  techniques  to  cast  continuous  PDEs  in  graph  form.The  approach 
of  defining  functions  with  domains  corresponding  to  nodes  in  a  graph  has  successfully  been  used 
in  areas,  such  as  spectral  graph  theory  [24, 84]. 

As  pointed  out  in  [9],  there  are  interesting  connections  between  the  GL  functional  on  graphs  and 
normalized  graph  cuts.  Shi  and  Malik  [96]  pose  the  problem  of  image  segmentation  as  the  solution 
of  a  generalized  eigensystem  generated  from  a  graph  Laplacian.  In  [10],  graph  cuts  are  used 
to  efficiently  find  local  minima  of  a  wide  class  of  energies  with  various  smoothness  constraints 
for  multiclass  image  restoration.  Also,  as  mentioned  earlier,  the  method  in  [100]  is  a  recursive 
graph-based  partition  scheme.  A  multiclass  algorithm  for  the  transductive  learning  problem  in 
high-dimensional  data  classification,  described  in  [15],  is  based  on  i1  relaxation  of  the  Cheeger 
cut  and  the  piecewise  constant  Mumford-Shah  or  Potts  models.  In  [12],  rigorous  convergence 
results  are  presented  for  two  algorithms  that  solve  the  relaxed  Cheeger  cut  minimization  used  for 
unsupervised  data  clustering  are  presented.  Our  proposed  methods  are  related  to  some  of  these 
approaches,  but  use  the  graph  Ginzburg-Landau  functional  framework. 

In  the  continuous  setting,  it  can  be  shown  that  the  GL  is  a  diffuse  interface  approximation  to  the 
total  variational  functional  [33,70],  and  analogous  results  have  recently  been  proved  in  the  graph 
setting  as  well  [103].  This  function  is  a  natural  framework  for  producing  smooth  labels  everywhere 
while  preserving  sharp  discontinuities,  with  the  sharpness  controlled  by  a  diffuse  interface  param¬ 
eter.  The  advantage  of  the  diffuse  interface  model  is  that  the  energy  functional  is  more  tractable, 
and  can  be  minimized  by  simpler  numerical  methods. 

3.4  Application  to  Image  Processing 

Below,  we  show  the  image  processing  results.  To  embed  an  image  into  a  graphical  framework, 
we  consider  each  pixel  as  a  vertex. 
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Application  to  Image  Labeling 

We  applied  our  algorithm  to  segment  objects  in  images  of  cows  from  the  Microsoft  image 
database.  The  goal  was  image  labeling,  where  two  images  are  inputted  into  the  algorithm,  one  of 
which  has  been  hand- segmented  (partially  or  completely)  into  classes.  The  algorithm  segments  the 
second  image  based  on  the  segmentation  of  the  first. 

Binary  Image  Labeling  A  fully  connected  graph  is  constructed,  and  the  entries  in  the 
weight  matrix  are  calculated  using  feature  vectors.  Every  pixel  in  the  image  is  assigned  a  feature 
vector  consisting  of  intensity  values  of  pixels  in  its  neighborhood,  which  was  of  size  7  x  7  in 

d(x,y)2 

our  tests.  We  use  the  formula  w(x,y)  =  e  x2  ,  where  d(x,y )  is  the  weighted  2-norm  of  the 
difference  of  the  feature  vectors  of  pixels  x  and  y,  and  we  add  along  the  three  RGB  channels  of 
the  image.  The  weighted  2-norm  modifies  the  components  of  the  entered  vector  by  giving  more 
weight  to  the  pixels  close  to  the  original  pixel.  We  use  a  linearly  decreasing  kernel.  This  construc¬ 
tion  can  be  used  to  segment  different  types  of  objects  using,  for  example,  their  color  and  texture 
features.  Note  that  the  weight  function  can  be  modified  according  to  the  image.  For  example,  a 
weight  function  calculated  using  the  spectral  angle  may  be  more  effective  in  the  segmentation  of 
hyperspectral  images. 

To  obtain  eigenvalues  and  eigenvectors  of  Ls,  the  Nystrom  extension  method  is  used,  since  the 
size  of  the  graph  is  large.  For  the  problem,  the  fidelity  term  is  the  hand-labeled  image,  and  we 
initialize  u  to  be  the  class  number  if  it’s  known  and  a  middle  value  otherwise. 

The  results  are  displayed  in  Figure  3,  where  it  is  shown  that  our  algorithm  is  robust  to  mislabel¬ 
ing  in  the  hand  labeled  image.  To  transfer  the  label  for  the  grass,  cows  and  sky,  our  method  needed 
about  29,  29,  27  seconds,  respectively.  The  number  of  iterations  in  the  minimization  procedure 
(step  4  of  the  algorithm)  and  minimization  time  as  compared  to  the  method  in  [9]  are  displayed  in 
Table  1.  The  calculations  show  that  our  method  significantly  reduces  both. 

Multiclass  Image  Labeling  We  also  conducted  the  image  labeling  task  using  multiple  classes . 
The  results  are  shown  in  Figure  4.  The  weight  matrix  is  constructed  similarly  to  the  way  of  the 
previous  section,  with  the  neighborhood  of  size  5x5.  However,  here  we  use  the  weight  funtion 
(2)  and  create  a  sparse  graph.  A  local  scaling  graph  with  M  =  30  is  constructed.  For  the  fidelity 
term,  2.6%  of  labeled  pixels  are  used. 

The  multiclass  Ginzburg-Fandau  method  used  the  following  parameters:  30  eigenvectors,  e  = 
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Minimization  time 

Minimization  time 

in  method  in  [9] 

in  our  method 

grass  label 

8  s 

3.5  s 

cow  label 

18  s 

3.5  s 

sky  label 

6  s 

1.8  s 

#  of  iterations 

#  of  iterations 

in  method  in  [9] 

in  our  method 

grass  label 

130 

22 

cow  label 

274 

29 

sky  label 

84 

11 

Table  1 :  Comparison  of  Minimization  Time  and  Number  of  Iterations  of  the  MBO  method  and  [9] 

1,  dt  —  0.1,  n  =  50  and  q  =  Kb  7.  The  average  time  for  segmentation  using  different  fidelity 
sets  was  19.9  s.  The  multiclass  MBO  method  used  the  following  parameters:  30  eigenvectors, 
dt  —  0.1,  fi  —  50,  q  =  10  7.  The  average  time  for  segmentation  over  10  runs  was  around  1.2  s. 

Application  to  Image  Segmentation 

Grayscale  Image  Segmentation  We  tested  our  algorithms  on  the  image  of  figures  shown  in 
Figure  5a.  This  is  a  191  x  196  pixel  grayscale  image,  to  be  divided  into  five  classes.  To  construct 
the  weight  matrix,  we  use  feature  vectors  corresponding  to  a  pixel’s  ^-coordinate,  y-coordinate, 
and  intensity.  A  local  scaling  graph  with  M  =  30  is  constructed.  For  the  fidelity  term,  1,500  or 
4%  of  the  points  were  randomly  chosen. 

The  original  image  as  well  as  the  segmentation  results  are  included  in  Figure  5.  In  each  seg¬ 
mentation,  the  white  regions  correspond  to  the  identified  class.  The  multiclass  Ginzburg-Landau 
method  used  the  following  parameters:  10  eigenvectors,  e  =  1,  dt  =  0.5,  fi  —  50  and  q  =  10~7. 
It  was  able  to  segment  the  classes  perfectly,  with  an  average  time  of  4.1  s.  The  multiclass  MBO 
method  used  the  following  parameters:  10  eigenvectors,  dt  =  0.5,  //  =  50,  q  =  10'  7.  The 
algorithm  was  able  to  segment  all  the  points  correctly  in  2  iterations  and  0.232  s. 

We  compare  our  result  to  the  method  of  Li  and  Kim  [76],  which  also  segments  the  image 
perfectly.  However,  their  method  requires  additional  information,  such  as  the  densities  of  each 
class,  that  we  do  not  need  in  our  method. 

Multiclass  Image  Segmentation  We  then  tested  our  algorithms  on  the  color  image  of  cows 
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(a)  Original  Labeled  Image 


(c)  Regions  with  Grass  Label 


(e)  Regions  with  Cow  Label 


(g)  Regions  with  Sky  Label 


(b)  Unlabeled  Image 


(d)  Grass  Label  Transferred 


-1 

a  ~\u" 

M 

(f)  Cow  Label  Transferred 


(h)  Sky  Label  Transferred 


Figure  3:  Image  Labeling  Example  1 


shown  in  Figure  6a.  This  is  a  213  x  320  color  image,  to  be  divided  into  four  classes:  sky,  grass, 
black  cow  and  red  cow.  The  graph  is  constructed  in  the  same  way  as  in  the  multiclass  image 
labeling  example.  For  the  fidelity  term,  2.6%  of  labeled  pixels  are  used. 

The  multiclass  GL  method  used  the  following  parameters:  30  eigenvectors,  e  =  1,  dt  =  0.1, 
/i  =  50  and  //  =  10-' .  The  average  time  for  segmentation  using  different  fidelity  sets  was  19.9  s. 
The  multiclass  MBO  method  used  the  following  parameters:  30  eigenvectors,  dt  —  0.1,  n  —  50, 
//  =  HE  7.  The  average  time  for  segmentation  was  around  1.2  s,  and  there  were  11  iterations. 

One  of  the  results  of  each  of  our  two  methods  (using  the  same  fidelity  set)  is  depicted  in  Fig¬ 
ure  6.  It  can  be  seen  that  both  methods  are  able  to  identify  all  the  classes. 


36 


(c)  Image  to  Segment  (d)  Multiclass  GL  (e)  Multiclass  MBO 

Figure  4:  Image  Labeling  Example  2 


(e)  Class  4 


(c) Class  2 


1 


(f)  Class  5 


Figure  5:  Image  Segmentation  Example  1 


Application  to  Image  Inpainting 

The  problem  of  fitting  information  in  the  missing  pixels  of  an  image  is  an  important  inverse 
problem  in  image  processing  with  various  applications.  Obviously,  the  goal  is  to  produce  a  modi¬ 
fied  image  that  will  look  natural  to  an  observer.  The  problem  of  inpainting  may  also  be  seen  as  the 
problem  of  removing  occlusive  objects  from  an  image.  Sparse  reconstruction  refers  to  the  problem 
of  recovering  randomly  distributed  missing  pixels. 

There  are  numerous  approaches  to  solve  these  problems  in  the  current  literature.  Local  TV 
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(a)  Original  Image 


(b)  Black  Cow:  multiclass  GL 


(d)  Red  Cow:  multiclass  GL 


(f)  Grass:  multiclass  GL 


(h)  Sky:  multiclass  GL 


(c)  Black  Cow:  multiclass  MBO 


r 


(e)  Red  Cow:  multiclass  MBO 


(g)  Grass:  multiclass  MBO 


(i)  Sky:  multiclass  MBO 


Figure  6:  Image  Segmentation  Example  2 


methods  became  state-of-the-art  techniques  for  image  impainting.  However,  since  they  do  not 
perform  well  on  images  with  high  texture,  methods  that  decompose  images  into  cartoon  and  texture 
and  simultaneously  inpaint  both  are  developed  [7, 94].  The  problem  is  also  solved  with  nonlocal 
inpainting  methods.  We  are  particularly  interested  in  the  nonlocal  inpainting  algorithm  from  [5 1]  as 
we  develop  a  computationally  efficient  nonlocal  method.  Some  very  successful  nonlocal  methods 
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for  inpainting  and  sparse  reconstruction  are  given  in  [3]  and  [42].  Recently,  the  class  of  methods 
that  use  dictionaries  of  small  patches  that  commonly  appear  in  natural  images  became  increasingly 
popular.  Those  methods,  besides  inpainting,  are  also  successful  in  denoising  as  shown  in  [78].  In 
addition,  a  method  for  image  inpainting  using  Navier-Stokes  fluid  dynamics  is  proposed  in  [9]. 
The  authors  use  Navier-Stokes  dynamics  to  propagate  isophotes  into  the  inpainting  region,  thus 
simulating  the  way  painting  restoration  is  done.  Wavelets  and  framelets  are  also  successfully 
applied  to  solve  inpainting  problems  [19,32]. 

We  modify  our  segmentation  algorithm  slightly  for  the  purpose  of  image  inpainting. 

Binary  Image  Inpainting  Although  the  key  steps  of  the  segmentation  algorithm  remain  the 
same  when  it  is  modified  for  image  inpainting,  there  are  differences  to  be  noted.  For  example,  if 
a  damaged  image  is  used  to  construct  the  adjacency  matrix  W,  the  results  might  not  be  accurate, 
so  we  first  apply  a  fast  and  simple  H1  inpainting  algorithm  on  the  image  and  then  use  the  result  to 
create  W.  The  II'  inpainting  algorithm  we  apply  is  just  the  local  minimization  problem: 


mm 

U 


\Wu\2dx+  /  X(x)(u  —  uo)2dx. 


The  latter  term  in  the  sum  is  the  fidelity  term,  and,  of  course,  we  are  not  limited  to  this  formulation 
of  it.  Although  the  latter  algorithm  is  fast,  it  does  not  perform  well  on  images  with  high  textures 
and  repetitive  structures  nor  does  it  preserve  edges  [46],  something  that  we  achieve. 

The  matrix  W  is  built  by  using  a  window  of  a  certain  size  around  each  pixel.  We  set  W (x,  y )  = 
0  for  all  pixels  j  that  are  not  in  the  window  of  pixel  i.  Inside  the  window,  W(x,y)  =  w(x,y), 
where  the  weight  function  is  calculated  in  the  same  way  as  for  binary  image  labeling.  No  updating 
of  the  matrix  W  is  necessary.  The  Rayleigh-Chebyshev  procedure  is  used  to  calculate  the  eigen¬ 
vectors  and  eigenvalues  of  the  graph  Laplacian  for  binary  inpainting.  The  fidelity  region  is  the 
non-damaged  region.  We  initialize  u  to  be  the  middle  value  for  the  non-fidelity  points. 

Grayscale  Image  Inpainting  To  generalize  to  graycale  inpaining,  we  split  the  signal  bit-wise 
into  channels,  as  in  [32]: 

K- 1 

u(x)  =  y^mQg)  2m, 

m= 0 

where  urn  denotes  the  mth  digit  in  the  binary  representation  of  the  signal,  and  urn  e  {0,1}  for  Mx. 

A  fully  connected  graph  is  created  in  the  same  way  as  in  the  binary  inpainting  case.  The 
Nystrom  extension  method  is  used  to  calculate  the  eigenvalues  and  corresponding  eigenvectors 
since  the  size  of  the  graph  is  very  large.  The  fidelity  region  and  initialization  is  the  same  as  in  the 
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binary  inpainting  case. 

Updating  the  weight  matrix  is  often  necessary  for  grayscale  inpainting,  since  the  adjacency 
matrix  formed  from  the  damaged  image  is  usually  not  good  enough  to  restore  texture  and  complex 
patterns,  as  it  contains  “bad”  regions  whose  values  lie  far  from  the  true  value.  In  our  tests,  every 
few  iterations,  the  matrix  is  updated  using  the  result  from  the  last  iteration  as  the  “new  image”. 

Binary  Image  Inpainting  Results  The  binary  image  inpainting  results  and  their  PSNR  are 
displayed  in  Figure  7.  In  both  cases,  the  algorithm  was  able  to  recover  the  texture  and  repetitive 
structure  present  in  the  image. 

Grayscale  Image  Inpainting  Results  The  grayscale  inpainting  results  along  with  their  PSNR 
are  displayed  in  Figures  8-13.  In  all  cases,  repetitive  structure  and  texture  were  recovered. 

We  compare  our  results  to  local  and  nonlocal  TV  inpainting.  Local  TV  inpainting  fails  to  re¬ 
cover  texture  and  repetitive  structure.  While  the  results  of  nonlocal  TV  inpainting  are  comparable 
to  those  of  our  method,  our  method  is  more  efficient.  Timing  results  are  displayed  in  Table  2,  and 
we  see  that  our  method  is  several  times  faster.  We  also  show  our  method  and  nonlocal  TV  inpaint¬ 
ing  at  certain  iterations  in  Figure  14.  To  implement  the  nonlocal  TV  inpainting  algorithm,  we  used 
the  Bregmanized  version  detailed  in  [112]  and  modified  it  for  inpainting.  The  stopping  condition 
was  the  same  as  in  our  inpainting  algorithm,  and  a  quick  H 1  inpainting  algorithm  was  run  on  the 
image  before  the  weights  were  calculated. 


Total  time  for 

nonlocal  TV 

Total  time  for 

our  method 

chessboard-like  pattern 

266  s 

48  s 

text  inpainting 

410  s 

67  s 

small  rectangle  inpainting 

1882  s 

443  s 

large  rectangle  inpainting 

3397  s 

832  s 

50%  inpainting 

1402  s 

333  s 

Table  2:  Comparison  of  Runtime  of  the  MBO  method  and  that  of  Nonlocal  TV 
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(d)  original  image-  stripes 


(e)  damaged  image-  stripes 


(f)  our  method’s  result-  PSNR  25.0687 


Figure  7:  Binary  Inpainting  Example.  For  the  Barbara  image,  the  simulation  took  113  seconds,  and  there  were  6 
iterations  in  the  two-step  scheme.  We  used  C\  =  700,  dt  =  0.003,  er  =  45,  31  x  31  neighborhood  for  feature  vector 
calculation,  21  x  21  window  and  calculated  400  eigenvectors.  For  the  image  of  stripes,  the  simulation  took  66  seconds, 
and  there  were  4  iterations  in  the  two-step  scheme.  We  used  C i  =  700,  dt  =  0.002,  a  =  45,  17  x  17  neighborhood 
for  feature  vector  calculation,  21  x  21  window  and  calculated  200  eigenvectors. 
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Figure  8:  Grayscale  Inpainting  Example.  The  simulation  took  48  seconds,  and  there  were  2  iterations  in  the  two-step 
scheme.  We  used  C\  =  700,  dt  =  0.005,  a  =  20,  41  x  41  neighborhood  for  feature  vector  calculation,  and  calculated 
600  eigenvectors.  No  updating  of  W  was  necessary.  The  nonlocal  inpainting  took  266  seconds. 
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(a)  original  image-  Barbara 


(b)  damaged  image-  Barbara 


(c)  local  TV  inpainting-  PSNR  29.1508 


(d)  nonlocal  TV  inpainting-  PSNR  35.6896  (e)  our  method’s  result-  PSNR  34.0688 


Figure  9:  Text  Inpainting  Example.  The  simulation  took  67  seconds,  and  there  were  4  iterations  in  the  two-step 
scheme.  We  used  C\  =  700,  dt  =  0.005,  er  =  5,  21  x  21  neighborhood  for  feature  vector  calculation,  and  calculated 
500  eigenvectors.  We  update  W  every  other  iteration.  The  nonlocal  inpainting  took  410  seconds. 
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(d)  nonlocal  TV  inpainting-  PSNR  44.1469  (e)  our  method’s  result-  PSNR  41.2848 

Figure  10:  Region  Inpainting  Example  1.  The  simulation  took  443  seconds,  and  there  were  13  iterations  in  the  two-step 
scheme.  We  used  C\  =  700,  dt  =  0.01,  a  =  4,  31  x  31  neighborhood  for  feature  vector  calculation,  and  calculated 
500  eigenvectors.  We  update  W  every  iteration.  The  nonlocal  TV  inpainting  took  1882  seconds. 


(d)  nonlocal  TV  inpainting-  PSNR  35.0663  (e)  our  method’s  result-  PSNR  37.0315 


Figure  11:  Region  Inpainting  Example  2.  The  simulation  took  832  seconds,  and  there  were  13  iterations  in  the  two- 
step  scheme.  We  used  C\  =  700,  dt  =  0.014,  er  =  4,  45  x  45  neighborhood  for  feature  vector  calculation,  and 


calculated  500  eigenvectors.  We  update  W  every  iteration.  The  nonlocal  inpainting  took  3397  seconds. 


(d)  nonlocal  TV  inpainting-  PSNR  27.8196  (e)  our  method’s  result-  PSNR  27.1651 


Figure  12:  50%  Reconstruction  Example.  The  simulation  took  333  seconds,  and  there  were  50  iterations  in  the  two- 
step  scheme.  We  used  C\  =  700,  dt  =  0.005,  a  =  4,  7  x  7  neighborhood  for  feature  vector  calculation,  and  calculated 
400  eigenvectors.  We  update  W  every  iteration.  The  nonlocal  inpainting  took  1402  seconds. 
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(a)  damaged  image-  35%  of  the  pixels  removed  (b)  local  TV  inpainting-  PSNR  22.6530 


(c)  our  method’s  result-  PSNR  24.1266 


Figure  13:  65%  Reconstruction  Example 


(a)  nonlocal  TV- after  2  iter.- PSNR  (b)  nonlocal  TV- after  5  iter.- PSNR  (c)  nonlocal  TV- after  8  iter.- PSNR  (d)  nonlocal  TV-  after  13  iter. 


PSNR  35.0663 


(e)  our  method-  after  2  iter.-  PSNR  (f)  our  method-  after  5  iter.-  PSNR  (g)  our  method-  after  8  iter.-  PSNR  (h)  our  method-  after  13  iter.- 
30.4406  31.8993  34.4851  PSNR  37.0315 


Figure  14:  Visualization  of  Inpainting  at  Different  Iterations 


з. 5  Application  to  Classification 

We  have  applied  our  algorithm  to  both  the  binary  and  multiclass  classification  problem  and 
tested  it  on  benchmark  data  sets. 

Application  to  Binary  Classification 

Two  Moons  Data  Set  This  data  set  was  used  in  [18]  in  relation  to  spectral  clusering.  It  is 
constructed  from  the  following  two  half  circles  in  M2  with  radius  one.  The  first  half  circle  is 
centered  at  the  origin  and  is  in  the  upper  half  plane.  The  second  half  circle  is  formed  by  taking 
the  lower  half  of  the  circle  centered  at  (1,  0.5).  A  thousand  points  are  chosen  uniformly  from  each 
of  the  two  half  circles.  The  two  thousand  points  are  then  embedded  in  M100,  and  i.d.d.  Gaussian 
noise  with  standard  deviation  0.02  is  added  to  each  coordinate.  The  goal  is  to  segment  those  two 
half  circles. 

To  create  the  graph,  we  used  the  weight  function  (2),  and  thus  created  a  sparce  graph.  To 
calculate  the  eigenvectors,  the  Rayleigh-Chebyshev  procedure  [1]  is  used.  There  is  no  fidelity 
term,  but  we  use  a  zero  mass  constraint  due  to  the  equal  size  of  the  classes.  For  initialization  of 

и,  we  use  the  sign  of  the  second  eigenvector  of  the  symmetric  Laplacian  after  the  mean  constraint 
has  been  applied  to  it  (see  Section  15.4). 

We  compared  our  results  to  the  method  of  Bertozzi  and  Flenner  in  [9]  by  running  simulations 
on  35  different  randomly  generated  two  moons  data  sets.  The  average  accuracy  was  96.0520% 
and  96.0460%  for  our  method  and  the  method  in  [9],  respectively.  However,  40  iterations  in  the 
minimization  procedure  were  used,  compared  to  300  needed  using  the  method  in  [9].  Therefore, 
our  method  resulted  in  a  significant  decrease  in  the  number  of  iterations. 

We  also  compared  our  results  to  a  spectral  clustering  method  of  thresholding  the  second  eigen¬ 
vector  of  Ls.  The  results  are  displayed  in  Figure  15.  Clearly,  clustering  using  the  second  eigenvec¬ 
tor  does  not  result  in  an  accurate  segmentation. 

House  Voting  Records  Data  Set  We  applied  our  algorithm  to  the  US  House  of  Representa¬ 
tives  voting  records  data  set,  which  consists  of  16  different  votes  from  each  of  the  435  individuals. 
The  goal  was  to  assign  each  individual  to  either  the  Republican  or  the  Democrat  party  using  the 
prior  knowledge  of  the  party  affiliation  of  only  five  individuals,  two  Democrats  and  three  Republi¬ 
cans.  The  votes  were  taken  in  1984  from  the  98th  United  States  Congress,  2nd  session. 
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(a)  second  eigenvector  segmentation-  83.75%  (b)  our  method’s  segmentation-  97.7% 

Figure  15:  Two  Moons  Data  Set  Results.  Segmentation  by  thresholding  the  second  eigenvector  and  our  method, 
respectively.  The  four  parameters  s  (in  step  IV  of  our  algorithm),  number  of  eigenvectors,  dt,  and  M  (parameter  in 
the  Zelnik-Manor  and  Perona  weight  function)  are  set  to  3,  25,  0.725  and  13,  respectively. 

A  weight  matrix  is  constructed  using  calculations  involving  feature  vectors.  A  16-dimensional 
feature  vector  is  assigned  to  each  individual  consisting  of  his/her  16  votes.  A  “yes”  vote  is  set  to 
1,  a  “no”  vote  is  set  to  —1,  while  a  “did  not  vote”  recording  is  set  to  0.  The  weight  function  used 
is  (1).  The  graph  is  made  sparse  by  setting  W(x,y)  equal  to  zero  if  point  y  is  not  among  the  Mth 
closest  points  to  point  x.  To  calculate  the  eigenvectors,  a  SVD  solver  is  used.  We  initialize  u  to 
the  middle  value  for  non-fidelity  points.  The  fidelity  term  consisted  of  two  Democracts  and  three 
Republicans.  The  three  Republicans  were  chosen  to  be  the  first,  second  and  eighth  person  in  the 
list.  The  Democrats  were  chosen  to  be  the  third  and  fourth  person  in  the  list.  The  parameters  C\ 
(fidelity  term  parameter),  s  (in  step  IV  of  our  algorithm),  number  of  eigenvectors,  dt,  a  and  M  are 
set  to  9.25,  3,  45,  4.675,  a/5  and  10,  respectively. 

We  obtained  an  accuracy  of  94.023%.  Only  5  iterations  in  the  minimization  procedure  were 
needed  compared  to  450  iterations  needed  by  the  method  in  [9].  Some  of  the  votes  predicted  the 
party  affiliation  very  well,  i.e.  above  85%.  We  investigated  the  accuracy  of  our  algorithm  when 
these  votes  were  removed.  With  top  two,  top  six  and  top  eight  most  predictive  votes  removed,  our 
method  obtained  an  accuracy  of  90.1149%,  88.34448%  and  81.1494%,  resp..  The  order  of  the  top 
eight  predictive  votes  from  the  most  predictive  to  least  predictive  is  vote  4,  14,  1,  2, 15,  6,  3  and  8. 
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Application  to  Multiclass  Classification 

For  multiclass  classification,  we  have  tested  our  algorithms  on  several  benchmark  data  sets.  In 
all  cases,  we  compute  the  graph  by  (2),  and  make  it  sparse  by  only  connecting  vertices  that  are 
near  enough  to  each  other. 

All  the  results  and  comparisons  with  other  published  methods  are  summarized  in  Tables  14  and 
4.  Due  to  the  arbitrary  selection  of  the  fidelity  points,  our  reported  values  correspond  to  averages 
obtained  over  10  runs  with  different  random  selections.  The  timing  results  and  number  of  iterations 
of  the  two  methods  are  shown  in  Tables  5  and  6,  respectively.  The  methods  are  labeled  as  “multi¬ 
class  GL”  and  “multiclass  MBO”.  These  comparisons  show  that  our  methods  exhibit  a  performance 
that  is  competitive  with  or  better  than  the  current  state-of-the-art  segmentation  algorithms. 

Parameters  are  chosen  to  be  compatible  between  the  methods.  For  the  multiclass  GL  method, 
the  convexity  constant  used  is:  C  —  n  +  K  This  is  the  minimum  constant  that  guarantees  the  con¬ 
vexity  and  concavity  of  the  terms  in  the  energy  partition  of  the  convex  splitting  strategy  employed. 
For  the  multiclass  MBO  method,  as  discussed  in  the  previous  section,  the  diffusion  step  can  be  re¬ 
peated  a  number  of  times  before  thresholding.  In  all  of  our  results,  we  run  the  diffusion  step  three 
times  before  any  thresholding  is  done  (Ns  =  3).  To  compute  the  eigenvectors  and  eigenvalues  of 
the  symmetric  graph  Laplacian,  we  use  fast  numerical  solvers.  We  use  the  Rayleigh-Chebyshev 
procedure  of  [1]  for  computing  all  the  eigendecompositions. 

All  the  results  reported  point  out  that  both  multiclass  GL  and  multiclass  MBO  perform  well  in 
terms  of  data  segmentation  accuracy.  While  the  ability  to  tune  multiclass  GL  can  be  an  advantage, 
multiclass  MBO  is  simpler  and,  in  our  examples,  displays  even  better  performance  in  terms  of 
its  greater  accuracy  and  tiny  number  of  iterations  required.  The  relative  strength  and  speed  of 
multiclass  MBO  may  not  always  hold,  but  the  avoidance  of  a  nonconvex  functional  minimization 
that  takes  place  in  multiclass  GL  may  explain  the  accuracy  and  speed  increase.  Exploring  the 
underlying  connections  of  the  energy  evolution  of  these  methods  and  the  energy  landscape  for  the 
relaxed  Cheeger  cut  minimization  recently  established  in  [12]  are  to  be  explored  in  future  work. 

Synthetic  Data  The  synthetic  data  set  we  used  is  the  three  moons  data  set.  It  is  constructed 
by  generating  three  half  circles  in  M2.  The  two  half  top  circles  are  unit  circles  with  centers  at  (0,  0) 
and  (3,  0).  The  bottom  half  circle  has  radius  1.5  and  the  center  at  (1.5,  0.4).  Five  hundred  points 
from  each  of  those  three  half  circles  are  sampled  and  embedded  in  M 1 00  by  adding  Gaussian  noise 
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with  standard  deviation  of  0.14  to  each  of  the  100  components  of  each  point.  The  dimensionality 
of  the  data  set  and  the  noise  makes  segmentation  a  significant  challenge. 

The  weight  matrix  of  the  graph  edges  was  calculated  using  local  scaling  based  on  the  17th 
closest  point  (M  =  17).  The  fidelity  term  was  constructed  by  labeling  25  points  per  class,  75  points 
in  total,  corresponding  to  only  5%  of  the  points  in  the  data  set.  The  multiclass  GL  method  used  the 
following  parameters:  20  eigenvectors,  e  =  1,  dt  =  0.1,  //  =  30,  //  =  10-' .  The  method  was  able 
to  produce  an  average  of  98.4%  of  correct  classification,  with  a  corresponding  computation  time 
of  0.16  s  per  run  on  a  2.4  GHz  Intel  Core  i2  Quad  without  any  parallel  processing.  The  multiclass 
MBO  method  used  the  following  parameters:  20  eigenvectors,  dt  =  0.1,  /i  =  30, 77  =  10~7.  It  was 
able  to  segment  an  average  of  99.12%  of  the  points  correctly  over  10  runs  with  only  3  iterations 
and  about  0.01  s  of  computation  time.  One  of  the  results  obtained  is  shown  in  Figure  16. 

Table  14  gives  published  results  from  other  related  methods,  for  comparison.  Note  that  the 
results  for  p-Laplacians  [18],  Cheeger  cuts  [100]  and  binary  GL  are  for  the  simpler  binary  problem 
of  two  moons  (also  embedded  in  M100).  While,  strictly  speaking,  these  are  unsupervised  methods, 
they  all  incorporate  prior  knowledge  such  as  a  mass  balance  constraint.  We  therefore  consider 
them  comparable  to  our  SSL  approach.  The  “tree  GL”  method  [47]  uses  a  scalar  multiclass  GL 
approach  with  a  tree  metric.  It  can  be  seen  that  our  methods  achieve  the  highest  accuracy  on  this 
test  problem. 


Figure  16:  Three  Moons  Data  Set  Result.  Segmentation  of  three  moons  using  multiclass  MBO  (98.4667%  correct). 


MNIST  Data  Set  The  MNIST  data  set  [72]  is  composed  of  70,  000  28  x  28  images  of  hand¬ 
written  digits  0  through  9.  Examples  of  entries  can  be  found  in  Figure  17.  The  task  is  to  classify 
each  of  the  images  into  the  corresponding  digit.  The  images  include  digits  from  0  to  9;  thus,  this 
is  a  10  class  segmentation  problem. 

To  construct  the  weight  matrix,  we  used  the  local  scaling  based  on  the  8th  closest  neighbor 
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(M  =  8).  Note  that  we  perform  no  preprocessing,  i.e.  the  graph  is  constructed  using  the  28  x  28 
images.  For  the  fidelity  term,  250  images  per  class  (2500  images  corresponding  to  3.6%  of  the  data) 
are  chosen  randomly.  The  multiclass  GL  method  used  the  following  parameters:  300  eigenvectors, 
e  —  1,  dt  —  0.15,  [i  =  50  and  rj  =  10-'.  The  complete  set  of  70,000  images  was  segmented  with 
an  average  accuracy  of  96.8%  of  the  digits  classified  correctly  in  an  average  time  of  811  s.  The 
averages  are  obtained  over  10  runs.  The  confusion  matrix  for  the  best  result  obtained  is  included 
in  Table  7.  Most  of  the  mistakes  were  in  distinguishing  digits  4  and  9,  and  digits  5  and  8.  The 
multiclass  MBO  method  used  the  following  parameters:  300  eigenvectors,  dt  =  0.15,  /i  =  50, 
7]  =  HP  7.  The  algorithm  was  able  to  segment  an  average  of  96.91%  of  the  digits  correctly  over  10 
runs  in  only  4  iterations  and  15.382  s.  We  display  the  confusion  matrix  in  Table  8.  Note  that  most 
of  the  mistakes  were  in  distinguishing  digits  4  and  9,  and  digits  2  and  7. 

Table  14  compares  our  results  with  those  from  other  methods  in  the  literature.  As  with  the  three 
moon  problem,  some  of  these  are  based  on  unsupervised  methods  but  incorporate  enough  prior 
information  that  they  can  fairly  be  compared  with  SSL  methods.  The  methods  of  linear/nonlinear 
classifers,  A-ncarcst  neighbors,  boosted  stumps,  neural  and  convolutional  nets  and  SVM  are  all 
supervised  learning  approaches,  taking  60,000  of  the  digits  as  a  training  set  and  10,000  digits  as  a 
testing  set  [72],  in  comparison  to  our  SSL  approaches  where  we  take  only  3.6%  of  the  points  for 
the  fidelity  term.  Our  algorithms  are  nevertheless  competitive  with,  and  in  most  cases  outperform, 
these  supervised  methods.  Moreover,  we  perform  no  preprocessing  or  initial  feature  extraction 
on  the  image  data,  unlike  most  of  the  other  methods  we  compare  with  (we  did  exclude  from  the 
comparison,  however,  methods  that  explicitly  deskewed  the  image).  While  there  is  a  computational 
price  to  be  paid  in  forming  the  graph  when  data  points  use  all  784  pixels  as  features  (see  graph 
calculation  time  in  Table  5),  this  is  a  one-time  operation  that  conceptually  simplifies  our  approach. 
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Figure  17:  Examples  of  Digits  from  the  MNIST  Data  Set 

COIL  Data  Set  We  evaluated  our  performance  on  the  benchmark  COIL  data  set  [22,  88]. 
This  is  a  set  of  color  128  x  128  images  of  100  objects,  taken  at  different  angles.  The  red  channel  of 
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each  image  was  then  downsampled  to  16  x  16  pixels  by  averaging  over  blocks  of  8  x  8  pixels.  Then 
24  of  the  objects  were  randomly  selected  and  partitioned  into  six  classes.  Discarding  38  images 
from  each  class  leaves  250  per  class,  giving  a  data  set  of  1500  data  points. 

To  construct  the  weight  matrix,  we  used  a  local  scaling  based  on  the  Ath  closest  neighbor  (. M  = 
4).  The  fidelity  term  was  constructed  by  labeling  10%  of  the  points,  selected  at  random.  For 
multiclass  GL,  the  parameters  were:  50  eigenvectors,  e  =  1,  dt  =  0.2,  //  =  100  and  rj  =  10-7. 
This  resulted  in  91.4%  of  the  points  classified  correctly  (average)  in  2.3  s.  For  multiclass  MBO, 
the  parameters  were:  50  eigenvectors,  dt  =  0.2,  /i  =  100,  rj  =  10“7.  We  obtained  an  accuracy  of 
91.46%,  averaged  over  10  runs.  The  procedure  took  6  iterations  and  0.03  s. 

Comparative  results  reported  in  [98]  are  shown  in  Table  14.  These  are  all  SSL  methods  (with 
the  exception  of  /c-nearest  neighbors  which  is  supervised),  using  10%  fidelity  just  as  we  do.  Our 
results  are  of  comparable  or  greater  accuracy. 

WebKB  Data  Set  We  tested  our  methods  on  the  task  of  text  classification  on  the  WebKB  data 
set  [28].  This  is  a  collection  of  4199  webpages  from  Cornell,  Texas,  Washington  and  Wisconsin 
universities,  as  well  as  other  miscellaneous  pages  from  other  universities.  The  webpages  are  to  be 
divided  into  4  classes:  project,  course,  faculty  and  student.  The  data  set  is  preprocessed  as  in  [20]. 

To  construct  the  weight  matrix,  we  used  575  nearest  neighbors.  Tfidf  term  weighting  [20]  is 
used  to  represent  the  website  feature  vectors.  They  were  then  normalized  to  unitary  length.  The 
weight  matrix  points  are  calculated  using  cosine  similarity.  For  the  multiclass  GL  method,  the 
parameters  were:  250  eigenvectors,  e  =  1,  dt  —  1,  /x  =  50  and  rj  =  10~7.  The  average  accuracies 
obtained  are:  81.5%,  84.2%,  85.4%,  86.7%  and  87.3%  over  fidelity  sets  of  10%,  15%,  20%,  25% 
and  30%  of  the  points,  respectively.  The  average  computation  time  is  6.97  s.  For  the  multiclass 
MBO  method,  the  parameters  were:  250  eigenvectors,  dt  =  1,  fi  —  4,  rj  =  HP  7.  We  obtained 
average  accuracies  of:  83.71%,  85.75%,  86.81%,  87.74%  and  88.48%  over  fidelity  sets  of  10%, 
15%,  20%,  25%  and  30%  of  the  points,  resp..  The  procedure  took  0.05  s  and  7  iterations. 

We  compare  our  results  with  those  of  several  supervised  learning  methods  reported  in  [20], 
shown  in  Table  14.  For  these  methods,  two  thirds  of  the  data  was  used  for  training,  and  one  third  for 
testing.  Our  SSL  methods  obtain  higher  accuracy,  using  only  20%  fidelity  (for  multiclass  MBO). 
Note  also  that  a  larger  sample  of  points  for  the  fidelity  term  reduces  the  error  in  the  classification 
results,  as  shown  in  Table  4.  Nevertheless,  the  accuracy  is  high  even  for  the  smallest  fidelity  sets. 
Therefore,  the  methods  appear  quite  adequate  for  the  SSL  setting  where  only  a  few  labeled  data 
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points  are  known  beforehand. 

Swiss  Roll  Data  Set  The  Swiss  roll  data  set,  pictured  in  Figure  18a,  contains  1600  3 D  points 
arranged  in  spirals.  To  calculate  the  weight  matrix,  we  used  the  weight  function  in  [90]  with  10 
nearest  neighbors.  To  calculate  the  eigenvectors,  we  used  the  procedure  in  [1].  The  parameters 
used  were:  50  eigenvectors,  C  —  51,  e  =  1,  dt  —  0.1,  /x  =  50  and  =  10'7  with  80  fidelity  points 
(5%). 

Averaged  over  100  runs,  the  average  accuracies  obtained  were  94.1%  and  91.8%  for  the  mul¬ 
ticlass  GL  and  MBO  methods,  respectively.  The  visual  result  of  the  latter  method  is  included  in 
Figure  18c.  We  compare  this  result  to  the  one  obtained  using  spectral  clustering  (Figure  18b): 
average  accuracy  was  only  49.75%  over  100  runs,  with  the  graph  being  the  same  as  for  multiclass 
GL  and  MBO  case.  Calculations  were  done  with  4  eigenvectors. 


(a)  Swiss  roll  data  set  (b)  spectral  clustering  result  (49%)  (c)  our  method’s  result  (94.5%) 

Figure  18:  MBO  Method  Swiss  Roll  Data  Set  Results 

Landsat  Satellite  Data  Set  This  database  constains  multi-spectral  values  of  pixels  in  3  x  3 
neighborhoods  in  a  satellite  image,  and  the  classification  associated  with  the  central  pixel  in  the 
neighborhood.  The  goal  is  to  predict  this  classification,  using  the  multi-spectral  values.  The  data 
set  consists  of  6435  nodes,  each  representing  a  neighborhood  of  a  82  x  100  satellite  image. 

To  calculate  the  weight  matrix,  we  used  the  weight  function  in  [90]  with  30  nearest  neighbors. 
To  calculate  the  eigenvectors,  we  used  the  procedure  in  [1].  The  parameters  used  were:  200 
eigenvectors,  C  —  51,  e  =  1,  dt  —  0.1,  /i  =  50  and  rj  =  10' 7  with  350  fidelity  points  (5.44%). 

Averaged  over  30  runs,  the  average  accuracies  obtained  were  87.05%  and  87.25%  for  the  multi¬ 
class  GL  and  MBO  methods,  respectively.  We  compare  these  results  to  several  supervised  learning 
methods  listed  in  [86].  These  algorithms  were  performed  using  80%  training  and  20%  validation. 
The  results  were:  65.15%  using  SC-SVM,  75.43%  using  SH-SVM,  65.88%  using  S-LS,  86.65% 
using  simplex  boosting,  and  90.15%  using  S-LS  rbf.  We  outperform  all  but  the  last  algorithm  using 
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only  5%  fidelity,  while  these  algorithms  use  80%  for  training. 

Human  Activity  Data  Set  This  data  set  from  the  UCI  Machine  Learning  Repository  contains 
information  about  experiments  carried  out  with  volunteers.  Each  person  performed  one  of  six 
actions:  walking,  walking  upstairs,  walking  downstairs,  sitting,  standing  and  laying  while  wearing 
a  smartphone  on  the  waist.  The  smartphone  recorded  their  linear  acceleration  and  3-axial  angular 
velocity.  The  goal  is  to  segment  the  people  into  6  classes  according  to  activity  using  the  information 
obtained  from  the  phone.  The  data  set  has  10229  nodes.  We  used  the  weight  function  in  [90]  with 
59  nearest  neighbors.  The  parameters  used  were:  50  eigenvectors,  C  =  160,  dt  =  0.31,  //  =  159 
and  r/  =  10”'  with  5%  of  points  being  fidelity  points. 

The  average  accuracy  was  89.7%  for  the  multiclass  MBO  method,  while  being  88.7%  for  the 
GL  method.  We  compare  this  to  the  results  of  [2],  where  the  methods  MC-SVM  and  MC-HF- 
SVM  have  accuracies  of  89.3%  and  89.0%,  respectively.  It  is  important  to  note  that  the  results  of 
the  paper  were  obtained  using  supervised  learning  methods  where  70%  of  the  data  was  used  for 
training  and  the  rest  for  testing.  We  obtain  higher  accuracy  (with  multiclass  MBO)  using  only  5% 
fidelity. 

3.6  Application  to  Hyperspectral  Imagery 

We  consider  the  challenge  of  detection  of  chemical  plumes  in  hyperspectral  image  data.  Seg¬ 
mentation  of  gas  is  difficult  due  to  the  diffusive  nature  of  the  cloud.  The  use  of  hyperspectral 
imagery  provides  non-visual  data  for  this  problem,  allowing  for  the  utilization  of  a  richer  array 
of  sensing  information.  We  now  present  a  method  to  track  and  classify  objects  in  hyperspectral 
videos.  The  method  involves  the  application  of  a  new  algorithm  recently  developed  for  high  di¬ 
mensional  data.  It  is  made  efficient  by  the  application  of  spectral  methods  and  the  NystrA  "om 
extension  to  calculate  the  eigenvalues/eigenvectors  of  the  graph  Laplacian.  Results  are  shown  on 
plume  detection  in  LWIR  standoff  detection. 

Detecting  chemical  plumes  in  the  atmosphere  is  a  problem  that  can  be  applied  to  many  areas, 
such  as  defense,  security  and  environmental  protection.  If  the  airborne  toxins  are  identified  accu¬ 
rately,  one  can  combat  the  use  of  chemical  gases  as  weapons,  prevent  fatalities  due  to  accidental 
leakage  of  toxic  gases  and  avoid  contamination  of  the  atmosphere.  Identification  of  harmful  gases 
with  high  fidelity  is  needed  to  provide  warnings  in  threatening  situations.  In  these  grave  scenarios, 
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it  is  crucial  to  accurately  track  the  diffusion  of  dangerous  plumes  into  the  atmosphere.  Laboratory 
measured  signatures  of  toxic  chemicals  are  available  to  assist  in  chemical  plume  identification. 
However,  testing  and  training  data  is  not  readily  available  due  to  the  inherent  danger  of  these  real 
world  situations.  Instead,  open  air  testing  with  surrogate  chemicals  is  conducted  to  study  the  diffu¬ 
sion  of  chemical  plumes.  The  developed  plume  detection  methods  must  meet  strict  requirements 
to  ensure  the  fidelity  of  a  detector. 

We  propose  applying  the  method  outlined  in  [48]  to  hyperspectral  data,  in  particular,  to  track 
and  classify  chemical  plumes,  recorded  in  a  hyperspectral  video  sequence.  The  pixels  of  the  images 
in  the  video  are  considered  as  vertices  in  a  graph,  and  we  minimize  the  total  variation  with  fidelity 
to  known  data.  The  Nystrom  extension  method  is  used  to  efficiently  calculate  eigenfunctions  of 
the  graph  Laplacian.  They  are  then  used  both  for  operator  assisted  assignment  of  fidelity  values 
and  in  the  actual  total  variation  minimization  algorithm  itself.  The  paper  is  organized  as  follows: 
section  2  reviews  the  graph  representation  of  the  data  as  well  as  the  Nystrom  extension  method, 
section  3  presents  the  method  and  the  results,  and  the  section  4  contains  the  conclusions. 

We  consider  the  data  set,  described  in  more  detail  in  [16],  composed  of  video  sequences  record¬ 
ing  the  release  of  chemical  plumes  at  the  Dugway  Proving  Ground.  The  data  was  provided  by 
the  Applied  Physics  Laboratory  at  Johns  Hopkins  University.  The  images  are  of  dimension  128 
x  320  x  129,  where  the  last  dimension  indicates  the  number  of  channels,  each  depicting  a  par¬ 
ticular  frequency  from  7,820  nm  to  11,700  nm,  spaced  30  nm  apart.  The  sets  of  images  were 
taken  from  videos  captured  by  three  long  wave  infrared  (LWIR)  spectrometers,  each  placed  at  a 
different  location  about  2  km  away  from  the  release  of  plume  at  an  elevation  of  around  1300  feet. 
One  hyperspectral  image  is  captured  every  five  seconds.  This  data  set  has  been  studied  in  other 
works  such  as  [49],  [102],  [99].  Prior  work  on  hyper  spectral  plume  detection  using  other  sensors 
includes  [65]  (MWIR)  and  [79]  (HYDICE).  This  paper  is  the  first  example  of  the  new  graphical 
MBO  scheme  applied  to  standoff  detection  data.  The  results  are  excellent  compared  to  prior  work 
in  this  area. 

There  are  many  challenges  to  be  faced  when  tracking  chemical  plumes.  One  obstacle  faced  by 
the  authors  of  [49]  is  the  significant  preprocessing  needed  to  accurately  detect  the  plume.  Due  to 
the  noisy  structure  of  the  data  set,  principal  component  analysis  reduced  the  data  to  three  main 
features  used  to  produce  a  false  color  video  sequence  of  the  plume  release,  followed  by  midway 
equalization  to  smooth  the  flicker  between  frames.  Similar  preprocessing  is  used  in  [102],  which 
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(a)  2nd  Eigenvector 


(b)  3rd  Eigenvector 


(c)  4th  Eigenvector 


(d)  5  th  eigenvector 


Figure  19:  Eigenvectors  of  the  symmetric  normalized  graph  Laplacian  for  the  7  video  frames  shown  in  false  color. 

outlines  a  novel  plume  detection  method  involving  a  binary  partition  tree.  The  advantage  of  our 
method  is  that  it  does  not  require  any  preprocessing  of  the  hyperspectral  data;  we  use  the  raw 
data  organized  in  a  graph  setting.  Moreover,  as  pointed  out  in  [102],  the  ground  truth  data  is 
nonexistent,  since  surrogate  chemicals,  instead  of  the  toxic  ones,  are  used  in  testing.  This  makes 
the  assessment  of  the  results  somewhat  difficult.  The  authors  of  [49]  dealt  with  this  problem  by 
using  an  adaptive  matched  subspace  detector  (AMSD)  described  in  [79]  to  benchmark  their  spectral 
clustering  results.  AMSD  is  a  probabilistic  detection  scheme  that  uses  a  generalized  likelihood 
ratio  test  to  choose  between  two  hypotheses:  target  present  or  absent. 

We  represent  the  data  as  nodes  in  a  weighted  graph.  Since  we  wish  to  form  a  graph  that  utilizes 
information  inherent  in  the  data  over  time,  so  a  method  of  utilizing  data  from  multiple  time  steps 
was  implemented.  This  method  of  performing  multiframe  analysis  is  done  by  selecting  k  different 
video  frames  and  then  concatenating  the  data  points,  allowing  for  data  to  be  associated  over  these 
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(a)  Background  Frame  (b)  Gas  Frame  1  (c)  Gas  Frame  2  (d)  Gas  Frame  3 


Figure  20:  Fidelity  Region  and  Initialization  for  the  Hyperspectral  Video  Sequence.  First  4  frames  of  the  hyper  spectral 
video:  (top  row)  operator  assisted  ‘ground  truth’  results  from  spectral  clustering  -  used  as  fidelity  points  in  the  MBO 
clustering  algorithm;  (bottom  row)  initialization  for  the  MBO  scheme.  Classification  is  denoted  by  color:  green  = 
mountain;  blue  =  sky;  brown  =  foreground;  orange  =  plume.  White  pixels  denote  unclassified  pixels. 

k  frames.  The  computation  of  the  graph  Laplacian  for  all  of  these  pixels  results  in  a  very  large 
matrix.  As  an  example,  the  Dugway  Proving  Ground  hyperspectral  data  is  of  size  128  x  320 
pixels  with  129  spectral  bands.  For  7  frames,  the  full  graph  Laplacian  is  an  N  x  N  matrix  with 
N  =  286,  720.  Thus,  a  method  for  quickly  computing  eigenfunctions  of  the  graph  Laplacian  is 
desired.  Utilizing  the  Nystrom  method,  we  are  able  to  quickly  compute  an  accurate  approximation 
to  the  eigenfunctions. 

Figure  19  shows  a  sampling  of  four  different  eigenvectors.  Note  that  each  eigenvector  high¬ 
lights  a  different  aspect  of  the  image;  for  example,  the  third  eigenvector  outlines  the  plume.  In 
addition,  the  background  is  maintained  through  the  seven  different  frames.  The  total  run-time  for 
the  Nystrom  extension  with  100  eigenvectors  is  less  than  one  minute  on  a  2.8  GHz  Intel  Core 
2  Duo.  Below  we  use  these  eigenvectors  for  two  parts  of  our  multi-class  clustering  algorithm. 
In  [48],  an  algorithm  is  developed  to  efficiently  solve  the  multiclass  assigment  problem  for  graph- 
based  data  sets.  It  is  semi-supervised  and  thus  needs  “ground  truth”  assigments  for  part  of  the  data 
set.  Since  we  lack  real  ground  truth,  we  perform  operator  assisted  spectral  clustering  to  obtain 
partial  ground  truth.  This  is  performed  by  identifying  the  relevant  eigenfunctions  and  thresholding 
at  some  level  to  identify  a  subset  of  pixels  that  are  highly  likely  to  be  part  of  the  chosen  class.  For 
this  segmentation  we  choose  four  classes:  plume,  sky,  foreground,  and  mountain. 
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(a)  Background  Frame 


(b)  Gas  Frame  1 


(c)  Gas  Frame  2 


(d)  Gas  Frame  3 


Figure  21 :  Results  for  the  Hyperspectral  Video  Sequence.  First  4  frames,  Results  of  the  MBO  classification  algorithm. 
Color  as  specified  in  figure  20. 


Algorithm 


We  denote  the  class  label  vector  as  the  matrix  u.  The  node  i  adopts  a  composition  of  states 
u,  g  MA .  Here  K  =  4  for  four  class  segmentation.  The  kth  component  of  u,  is  the  probability  the 
node  belongs  to  class  k.  For  each  node  i,  we  require  the  vector  u,  to  be  an  element  of  the  Gibbs 
simplex  HK,  defined  as 


£a  :=  <(  (xi,  ...,xK)e  [0, 1] 


K 


K 


%k  =  i 


k= 1 


(18) 


Vertex  k  of  the  simplex  is  given  by  the  unit  vector  ek. 

It  is  shown  in  [48]  that  alternating  between  the  following  two  steps  results  in  an  efficient  classi¬ 
fication  algorithm: 


1 .  Heat  equation  with  forcing  term: 


U”+5 


u 


dt 


=  -Lsun+^  -  /i(un  -  u) 


(19) 


2.  Thresholding: 

<+1  =  ek, 

n+l 

where  ek  is  the  vertex  in  the  simplex  closest  to  the  projection  of  u,  2  onto  the  simplex 
using  [23]. 


Here  /!,  is  a  positive  constant  ft  if  node  z’s  label  is  known  beforehand  (fidelity  point)  and  0 
otherwise,  and  u,  is  a  vector  indicating  prior  class  knowledge  of  sample  i.  Also,  note  that  in  the 
second  step  the  row  vector  Ujn+2  of  step  1  is  projected  back  to  the  simplex  before  any  thresholding 
takes  place.  This  is  done  because  the  result  of  step  1  is  not  necessary  an  element  of  the  Gibbs 
simplex. 


Scheme  (19)  is  solved  using  the  eigenvalue/eigenvector  decomposition  of  the  symmetric  graph 
Laplacian.  The  Laplacian  term  is  treated  implicitly.  The  first  part  of  the  algorithm  can  be  rewritten 
as 

un+5  =  (I  +  df  Ls)-1(u”  —  un  —  u)).  (20) 

We  use  the  eigendecomposition  Ls  =  XAX7  to  write 

1  +  dtLs  =  X  (I  +  dt  A)  XT,  (21) 

but  we  approximate  X  by  a  truncated  matrix  retaining  only  Ne  eigenvectors  (Ne  <C  No),  to  form 
a  matrix  of  dimension  No  x  Ne.  The  term  in  the  parenthesis  in  (21)  is  a  diagonal  Ne  x  Ne 
matrix.  This  allows  one  to  calculate  u"+2  rapidly.  In  particular,  we  write,  for  the  nth  iteration, 
un  =  Xa"  and  fi(u11  —  u))  =  Xd",  where  a  and  d  are  matrices  of  dimension  Ne  by  K,  where 
K  is  the  number  of  classes.  Denote  E  to  be  the  diagonal  matrix  containing  the  eigenvalues  of  the 
symmetric  graph  Laplacian,  then  Lsun  =  XEa".  Also  denote  by  a/,  and  d /,.  the  kth  row  of  a  and 
d,  respectively.  Plugging  all  the  known  expressions  into  (19),  we  obtain  an  equation  for  a(! ‘  1  that 
effectively  replaces  (19): 


ak  ~ 


a£  -  dtdl 
1  +  dt  Afc 


where  A/c  is  the  kth  eigenvalue  of  the  symmetric  graph  Laplacian.  The  remaining  step  is  simple 
thresholding. 

We  tested  our  method  on  seven  video  frames,  using  four  classes,  to  segment  the  plume,  sky, 
foreground  and  mountain.  The  fidelity  region  is  shown  in  the  first  row  of  figure  20.  The  initializa¬ 
tion  for  the  MBO  scheme  is  displayed  in  the  second  row.  The  final  segmentation  results,  after  17 
iterations,  are  shown  in  figure  21,  using  first  four  frames. 

The  fidelity  region  is  calculated  differently  from  [48],  where  the  fidelity  points  were  chosen 
randomly  from  known  ground  truth  data.  Without  the  ground  truth,  we  use  an  operator  assisted 
method  involving  spectral  clustering.  In  particular,  by  thresholding  appropriately  the  values  of 
eigenvectors,  one  obtains  information  about  a  particular  class.  For  example,  as  shown  in  Figure 
19,  the  third  eigenvector  provides  information  about  the  plume.  By  thresholding  its  values,  one 
can  find  the  pixels  that  are  most  likely  part  of  the  plume.  We  used  the  fifth  eigenvector  to  obtain 
fidelity  for  the  mountain,  and  the  second  one  for  both  the  sky  and  the  foreground.  This  process 
resulted  in  36%  points  of  the  overall  points  from  all  the  frames  identified  as  as  good  fidelity  points. 
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For  the  MBO  scheme,  we  start  with  an  initialization  of  randomly  chosen  phase  classes  for 
non-fidelity  points  and  the  “ground  truth”  value  for  the  fidelity  points.  The  MBO  iteration  was 
performed  17  times  using  a  stopping  criterion  for  convergence,  with  dt  —  0.1  and  \x  =  100.  We 
compared  results  with  10  to  100  eigenvectors;  they  deteriorated  with  less  than  10  eigenvectors  but 
were  similar  with  more  than  10  eigenvectors.  In  all  tests  we  used  the  same  operator  assisted  fidelity 
points.  The  MBO  iteration  took  around  11  seconds  (for  100  eigenvectors)  on  a  2.4  GHz  Intel  Core 
i2  Quad,  after  obtaining  the  eigenvectors  from  the  Nystrom  scheme. 

We  thus  presented  an  application  of  a  recent  multiclass  classification  algorithm  [48]  to  hyper- 
spectral  video  data.  We  use  the  Nystrom  extension  method  to  efficiently  calculate  the  needed 
eigenvectors.  This  implementation  of  the  algorithm  requires  an  operator  assisted  spectral  cluster¬ 
ing  preprocessing  step  to  identify  a  subset  of  pixels  denoted  as  “ground  truth”  for  the  four  classes. 
The  resulting  classification  of  chemical  plumes  and  background  pixels  are  excellent.  Only  a  small 
number  of  eigenvectors,  ten  in  particular,  is  needed  to  achieve  a  good  result  and  no  preprocessing 
is  necessary.  The  entire  process  took  about  a  minute  on  desktop  PCs. 
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Table  3:  Multiclass  MBO  and  GL  Method  Classification  Results 


MNIST 


Two/Three  moons 


Method 

Accuracy 

spectral  clustering  [47] 

80% 

p-Laplacian  [18] 

94% 

Cheegercuts  [100] 

95.4% 

tree  GL  [47] 

97.4% 

binary  GL  [9] 

97.7% 

multiclass  GL 

98.4% 

multiclass  MBO 

99.12% 

Method 

Accuracy 

p-Laplacian  [18] 

87.1% 

multicut  normalized  1  -cut  [64] 

87.64% 

linear  classifiers  [71,72] 

88% 

Cheeger  cuts  [100] 

88.2% 

boosted  stumps  [68,72] 

92.3-98.74% 

transductive  classification  [101] 

92.6% 

tree  GL  [47] 

93.0% 

/c-nearest  neighbors  [71,72] 

95.0-97.17% 

neural/convolutional  nets  [25,71,72] 

95.3-99.65% 

nonlinear  classifiers  [71,72] 

96.4-96.7% 

multiclass  GL 

96.8% 

multiclass  MBO 

96.91% 

SVM  [29,71] 

98.6-99.32% 

COIL  WebKB 


Method 

Accuracy 

fc-nearest  neighbors  [98] 

83.5% 

LapRLS  [6,98] 

87.8% 

sGT  [67,98] 

89.9% 

SQ-Loss-I  [98] 

90.9% 

MP  [98] 

91.1% 

multiclass  GL 

91.4% 

multiclass  MBO 

91.46% 

Method 

Accuracy 

vector  method  [20] 

64.47% 

fc-nearest  neighbors  ( k  =  10)  [20] 

72.56% 

centroid  (normalized  sum)  [20] 

82.66% 

naive  Bayes  [20] 

83.52% 

SVM  (linear  kernel)  [20] 

85.82% 

multiclass  GL 

87.3% 

multiclass  MBO 

88.48% 

Other  Data  Sets 


Method 

Accuracy 

Accuracy 

multiclass  MBO 

multiclass  GL 

swissroll  data  set 

91.8% 

94.1% 

Landsat  satellite  data  set 

87.25% 

87.05% 

human  activity  data  set 

89.7% 

88.7% 
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Table  4:  MBO  Method  WebKB  Results 


Method 

10% 

15% 

20% 

25% 

30% 

WebKB  results  for  Multiclass  GL  (%  correct) 

81.5% 

84.2% 

85.4% 

86.7% 

87.3% 

WebKB  results  for  Multiclass  MBO  (%  correct) 

83.71% 

85.75% 

86.81% 

87.74% 

88.48% 

Table  5:  Runtime  (in  Seconds)  of  the  MBO  Method 


Data  set 

three  moons 

grayscale  image 

color  image 

MNIST 

COIL 

WebKB 

Graph  Calculation 

0.771 

19.96 

645.34 

6183.1 

0.95 

399.35 

Eigenvector  Calculation 

0.331 

210.10 

190.93 

1683.5 

0.19 

64.78 

Multiclass  GL 

0.163 

4.08 

19.92 

811.5 

2.31 

6.97 

Multiclass  MBO 

0.013 

0.23 

1.20 

15.4 

0.03 

0.05 

Table  6:  Number  of  Iterations  of  the  MBO  Method 


Data  set 

three  moons 

grayscale  image 

color  image 

MNIST 

COIL 

WebKB 

Multiclass  GL 

140 

90 

200 

460 

700 

275 

Multiclass  MBO 

3 

2 

11 

7 

6 

7 

CHAPTER  4 


Convex  Method 


In  this  section,  several  versions  of  a  convex  binary  classification  method  are  developed.  We  are 
interested  in  solving  partition  problems  of  the  form 

5  c  V'  E  w(x,y),  (22) 

( x,y)eE  :  x£S,  y£V\S 

which  is  the  formulation  of  the  minimum  cut  problem,  under  supervised  constraints 

S  D  Vf,  V\S  D  vb  (23) 


and  an  optional  volume  constraint 


\S\  =  a\V\t 


(24) 


where  0  <  a  <  1.  V?  C  V  is  a  set  of  nodes  that  are  known  a  priori  to  belong  to  the  region  S  and 
Vb  C  V  is  a  set  of  nodes  that  are  known  to  belong  to  region  V\S.  Variations  of  this  problem  have 
been  vastly  explored  in  literature.  For  example,  in  [14],  the  authors  describe  an  algorithm  which 
minimizes  a  normalized  version  of  the  cut,  specifically  the  balanced  cut.  A  multiclass  version  of 
this  method  is  introduced  in  [13]. 

By  defining  a  binary  function 


u(x)  = 


1,  x  G  S 
0,  X  e  V\S, 

the  supervised  minimum  cut  problem  can  be  regarded  as 


where 


as  defined  earlier  (with  q 


min  Ep(u)  =  TVw(u )  +  f(u(x),  x), 

u&B  L — ' 

x&V 


Y  _ 

TVw(u)  =  ~^2w(x,y)\u(x)  —  u(y)\ 

x.y 

1)  and 


(25) 


B  =  {u:V  ^  {0,1}} 


(26) 
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is  the  set  of  binary  functions  indicating  the  partition.  Here,  f(u(x),x)  is  a  fidelity  term  which 
incorporates  the  supervised  constraints  (23).  It  typically  takes  the  form  of 

f(u(x),x )  —  r](x)\u(x)  —  u°(x)  |2,  (27) 

where  u°  is  a  binary  function  taking  value  1  in  V-1  and  0  in  Vb,  and  r)(x)  is  a  function  that  takes  on 
a  large  constant  value  rj  on  fidelity  points  V1  U  Vb  and  zero  elsewhere.  If  //  is  chosen  sufficiently 
large,  it  can  be  guaranteed  that  the  solution  u  satisfies  the  supervised  constraints.  In  [75],  the 
authors  propose  solving  a  similar  minimization  problem  by  introducing  nonlocal  global  minimizers 
of  active  contour  models  on  graphs. 

In  addition,  when  the  size  of  the  two  classes  is  known,  the  volume  of  the  regions  may  be 
enforced  to  satisfy  a  constraint  of  the  form 

5>(x)  =  a|y|,  (28) 

xev 

where  a  is  the  fraction  of  the  nodes  belonging  to  class  2,  e.g.  a  =  \  enforces  partitions  of  equal 
volume.  The  goal  of  this  is  to  create  an  algorithm  that  requires  a  much  smaller  fidelity  set  (to 
produce  an  accurate  classification)  than  otherwise,  because  we  also  have  the  information  about 
class  size. 

In  previous  work  [9],  the  problem  (25)  was  formulated  as  the  minimization  of  a  Ginzburg- 
Landau  (GL)  functional  on  graphs  with  a  fidelity  term 

GLe(u)  =  || Wu\\2e  +  ^  W(u{x))  +  f{u(x),x),  (29) 

X 

where 

l|Vu|||  =  \^2w{x^y){u{x)  -  u{y)f 

x,y 

as  defined  before.  Note  that  as  e  — »  0,  in  the  limit  of  gamma  convergence,  the  sum  of  first  two 
terms  of  the  energy  converge  to  the  total  variation  term,  making  the  energy  exactly  the  same  as 
one  in  (25).  The  problem  is  solved  using  gradient  descent  and  an  efficient  convex  splitting  scheme. 
This  method  will  be  referred  to  as  “binary  GL”  in  the  paper,  and  we  compare  it  to  our  work. 

In  [80],  (29)  is  solved  numerically  by  a  variation  of  the  MBO  scheme  [81],  a  method  to  approx¬ 
imate  motion  by  mean  curvature.  To  make  everything  consistent  with  the  notation  and  theorems 
stated  in  the  paper,  we  include  an  extra  scaling  in  our  implementation  of  the  method  in  [80],  and 
the  justification  is  described  shortly.  We  note  that  this  change  in  the  method  did  not  exacerbate  the 
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results  as  compared  to  those  of  the  original  method;  in  fact,  it  produced  very  little  change  in  any 
simulation.  This  algorithm  will  be  referred  to  as  “binary  MBO”  in  the  paper,  and  we  compare  it  to 
our  new  algorithms.  The  discretized  version  of  the  algorithm  is  the  following: 

Starting  with  some  initial  classification  u  E  {0, 1},  alternate  between  the  following  two  steps 
until  the  stopping  criterion  is  satisfied: 


1.  Heat  equation  with  forcing  term: 

un+1*  -un  _  +1 _ 1  d f(u(x),x) 

dt  w  d(x)r  du 


(30) 


2.  Thresholding: 


{1,  if  un+2  (x)  >  0.5, 
0,  if  wn+2  (x)  <  0.5. 


Here,  after  the  second  step,  un+1(x)  can  take  only  two  values  of  1  or  0;  thus,  this  method  is 
appropriate  for  binary  segmentation. 

Following  [80],  (30)  is  solved  by  a  semi-implicit  scheme,  where  the  Laplacian  term  is  calculated 
implicitly,  and  the  terms  are  considered  as  a  linear  combination  of  the  eigenvectors  of  the  random 
walk  Laplacian. 

To  show  the  general  idea  of  the  derivation,  we  start  with  the  Ginzburg-Landau  (GL)  functional 
on  graphs  (29).  One  can  rewrite  it  using  inner  product  notation: 

GLe(u)  =  ||Vi4+  -e(D~rW(u(x)),l)v  +  (D-rf(u(x),x),l)v, 

where  ( D~rW(u))(x )  =  d{x)~rW (u(x)) .  The  factor  d{x)~r  is  needed  to  cancel  the  factor  d(x)r 
in  the  V-  inner  product. 

The  modified  Allen-Cahn  equation  can  then  be  derived  using  the  V-gradient  flow  associated 
with  GLe.  We  have 


^-GLe(u  +  tj)\t=0  =  —2(Awu,  7)v  +  ~(D  rW'{u{x))^)v  +  (D  r^f(u(x),x),  j)v. 
dt  e  ou 


The  modified  Allen-Cahn  equation  is  then 

1 


u(x)  =  2A  wu  — 


1  df 

W'(u{x ))  -  -77 —  —  (Hx),x) 


ed(x)r  d(x)r  du 

The  above  equation  can  be  solved  using  a  time-splitting  scheme,  where  the  splitting  occurs  so 

that  the  double-well  potential  term  is  separated.  The  first  step  is 

1  df 

u(x)  =  2A„u  -  ——(u(x),x) 
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and  the  second  step  (with  the  double-well  potential)  is  just  thresholding  in  the  e  — >•  0  limit.  By 
alternating  between  these  two  steps,  one  can  form  an  approximate  solution  of  the  modified  Allen- 
Cahn  equation  (4). 

Such  approaches  (binary  GL  and  binary  MBO  methods)  converge  to  the  nearest  local  minimizer 
from  a  given  initialization.  In  general,  one  cannot  guarantee  that  the  desired  global  minimizer  is 
obtained.  The  subject  of  this  work  is  to  develop  a  convex  optimization  framework  for  minimizing 
(25)  which  is  guaranteed  to  obtain  the  global  minimizer.  Various  algorithms  are  developed  for 
solving  the  convex  problems. 

The  problem  (25)  is  non-convex  because  the  binary  side  constraints  (26)  are  non-convex. 
We  show  that  the  binary  constraints  can  be  replaced  by  their  convex  hull  [0, 1]  to  obtain  an  exact 
convex  formulation  as  was  shown  in  [21]  for  images.  Define  first  the  functions 

Cs(x)  =  f( 0,x),  Ct(x)  =  f( l,x),  \/x  €  V, 

g((f>(x),x)  =  Ct(x)(l)(x)  +  Ca(x)(l-<j)(x)),  VxeV.  (31) 

The  problem 

min  E(u)  —  TVw(u)  +  g(u(x),x)  (32) 

uGB  ' 

xev 

is  equivalent  to  the  formulation  (25).  The  proof  is  obvious  as  g(f(x),x)  =  f(<f>(x),x)  for  all 
binary  f. 

The  convex  relaxed  problem  is  formulated  as  follows: 

(33) 

where 

B'  =  {u  :  V  [0,1]}.  (34) 

This  is  the  problem  we  will  be  concerned  with  in  this  section  about  the  convex  method. 

The  following  theorem,  which  is  a  generalization  of  theorem  1  in  [21]  from  images  to  graphs, 
shows  that  the  minimizer  of  the  convex  problem  can  be  thresholded  to  yield  a  binary  global  mini¬ 
mizer  of  the  original  problem. 

Theorem  2.  Let  u*  be  a  minimizer  of  (33).  Denote  by  ul  :  V  (->•  {0, 1}  the  binary  function 

1,  ifu*(x)  >  £. 

0,  ifu*(x)  <  L 
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Then  for  almost  every  l  G  (0, 1],  if  is  a  global  minimizer  of  the  non-convex  problem  (32). 


Proof:  For  any  function  u  G  B'  and  for  any  x  G  V,  if  u(x)  G  [0, 1],  then  J0'  if  (x)  dt  =  ifx). 
Therefore,  for  each  x  G  V, 

f  g(vf(x),  x)  dt  —  f  ue(x)Ct(x)+(l—ue(x))Cs(x)d£  —  u(x)Ct(x)+(l—u(x))Cs(x)—g(u(x),x). 
Jo  Jo 

By  the  coarea  formula,  we  have  that 


/  TVw(ue)de  =  TVw(u). 

Jo 

This  can  be  easily  shown  using  the  definition  of  TVW.  In  particular,  we  see  that 

f  TVw(vf)dl  =  f  \  ^  w(x,  y )  | ue(x)  -  u\y)\dx  = 

J°  J°  x,y 

^  ^  r  i  ^  ^ 

~^2w(x,y)  /  \u\x)  -u\y)\dx  =  ~^w(x,y)\u(x)  -  u(y)\  =  TVw(u). 

x,y  ®  x,y 

The  above  can  be  seen  using  the  fact  that  (assuming  that  u(x)  <  u(y)  without  loss  of  generality) 


u*(y) |  =  < 


iff  <  x. 

if  u(x)  <  i  <  u(y). 

if  £  >  y. 


so  that  Jq1  ue(x)  —  ue(y)\dx  =  | u(x)  —  u(y) |. 

For  a  proof  of  the  coarea  formula  in  the  continuous  case,  see  Appendix  B  of  [104]. 

Combining  the  above  properties,  we  obtain  that  for  any  u  G  B\ 

[  Ep(ue)d?  =  Y^  [  {TVw(ue)  +  g(ue(x),x))de=J2TVw(,u)+g(u(x),x)  =  Ep(u). 

For  a  u  that  minimizes  the  energy,  clearly  Ep(u )  <  Ep(ul)  for  any  l  G  (0, 1].  However,  equality 
(4)  can  then  only  be  true  provided  Ep(u )  =  Ep{ul )  for  almost  every  i  G  (0, 1].  In  other  words,  if 
also  minimizes  the  energy  for  almost  every  i  G  (0, 1].  ■ 

In  order  to  solve  (33),  we  consider  two  main  algorithms.  The  first  is  based  on  solving  an 
equivalent  formulation  of  the  problem,  which  can  be  identified  as  a  maximum-flow  problem,  by 
convex  optimization  techniques.  It  will  be  referred  to  as  the  “max-flow”  method  in  this  paper. 
We  present  three  versions  of  this  algorithm:  one  without  hard  supervised  constraints  (Convex 
Method:  Version  1),  one  with  hard  supervised  constraints  (Convex  Method:  Version  Is),  and  one 


69 


with  balancing  constraints  (Convex  Method:  Version  lb).  The  second  algorithm  (Convex  Method: 
Version  2)  solves  the  original  problem  by  the  augmented  Lagrangian  technique,  and  will  be  denoted 
the  “primal  augmented  Lagrangian”  method  in  this  section. 

4.1  Convex  Method  (Versions  1  and  Is):  Max-flow  Without  Balancing  Con¬ 
straints 

In  this  subsection,  we  introduce  the  first  version  of  solving  (33),  which  consists  of  formulating 
the  problem  in  an  equivalent  max-flow  setting  using  a  graphical  framework. 

We  have  discussed  earlier  the  general  graph  framework  where  one  considers  a  set  of  vertices  V 
and  the  set  of  edges  E  between  them.  We  now  extend  it  further  by  adding  two  “special”  vertices, 
the  source  and  the  sink,  to  which  the  regular  vertices  are  connected.  Let  us  now  view  this  problem 
in  a  more  physical  setting  by  considering  a  flow  that  passes  from  the  source  to  the  sink  through 
the  vertices  in  between.  Let  the  weight  defined  on  each  edge  represent  the  upper  bound  on  the 
amount  of  flow  that  is  allowed  to  pass  through  the  edge.  Of  course,  that  capacity  does  not  need  to 
be  reached,  this  is  just  the  upper  bound. 

The  max-flow  problem  [43]  aims  to  maximize  the  flow  coming  from  the  source  node  s  to  a  sink 
node  t.  We  let  ps,Pt  ■  V  (->•  M  denote  the  amount  of  flow  on  the  edges  that  connect  a  source  or  sink 
edge,  respectively,  to  a  regular  vertex  of  the  graph.  For  example,  ps(x)  denotes  how  much  flow  is 
passing  from  the  source  to  vertex  x.  Analogously,  pt(x)  denotes  how  much  flow  is  passing  from 
vertex  x  to  the  sink.  Now  let  p  :  E  i->-  M  represent  the  amount  of  flow  on  edges  between  pairwise 
points  in  V,  where  E  C  V  x  V.  For  example,  p(x ,  y)  denotes  the  amount  of  flow  passing  from 
vertex  x  to  vertex  y.  The  upper  capacities  on  the  source  edges  are  denoted  by  Cs  and  on  sink  edges 
by  Ct,  and  there  is  no  lower  bound  on  the  capacities.  Therefore, 

ps(x)  <  Cs(x)  \/x  (35) 

and 

pt{x)  <  Cs(x)  Vx.  (36) 

The  flows  p(x,y )  on  the  edges  (x,y)  are  bounded  by  \p(x,y)\  <  C.  In  this  section,  we  choose 
C  to  be  1.  However,  one  does  not  have  to  use  a  constant  bound  here;  one  can  consider  a  general 
function  C(x,  y).  The  amount  of  flow  in  the  graph  can  be  expressed  as  the  amount  of  flow  on  the 
source  edges,  which  we  want  to  maximize  under  flow  capacity  and  flow  conservation  constraints. 
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In  this  section,  we  describe  two  max-flow  problems.  The  first  is  equivalent  to  the  problem 
(25)  with  fidelity  term,  and  consequently  solves  the  original  problem  (22)  provided  the  penality 
parameter  rj  is  high  enough.  The  second  max-flow  problem  incorporates  the  supervised  constraints 
directly  without  the  need  for  a  very  large  penalty  term.  The  following  derivations  extend  the 
continuous  max-flow  problem  [108,  109]  from  images  to  general  graphs.  All  the  operators  are 
written  in  graph  form. 

4.1.1  Max-flow  Formulation  with  Supervised  Constraints  as  Fidelity  Term 

In  this  subsection,  we  show  that  the  following  primal  formulation  (that  can  be  interpreted  as  a 
max-flow  problem  over  the  graph)  is  equivalent  to  the  convex  partition  problem  (33): 


subject  to 


max  \p(ps,pt,p)  =  V 

Ps,Pt,P  t 

x&V 

Ps(X)} 

(37) 

\p(x,y)\  <  C, 

(x,y)  e  E ; 

(38) 

PS(X )  <  C8(X), 

Mx  e  V; 

(39) 

Pt(x)  <  ct(x), 

Vx  E  V; 

(40) 

di  vwp(x) -ps(x) +pt(x)  =  0, 

VxeV. 

(41) 

where  (38)  is  the  flow  capacity  constraint  on  edges  between  paiwise  nodes,  (39)  and  (40)  are  flow 
capacities  on  the  source  and  sink  edges,  and  (41)  is  the  flow  conservation  constraint.  Note  that 
we  use  the  divergence  operator  to  evaluate  the  net  flow  locally  around  vertex  x.  The  method  of 
solving  the  formulation  is  outlined  as  “Convex  Method:  Version  1”,  but  will  also  be  referred  to  as 
the  max-flow  algorithm. 

The  objective  function  (37)  measures  the  total  amount  of  flow  on  the  graph.  Due  to  con¬ 
straint  (39),  the  maximization  problem  (37)  is  bounded  above  by  fQ  Cs(x),  which  is  finite  provided 
f((j)(x),x)  is  bounded  (true  for  the  data  terms  considered  in  this  work). 

It  is  well  known  that  the  value  of  the  maximum  flow  of  the  maximum  flow  problem  is  equivalent 
to  the  value  of  the  minimum  cut  (where  one  sums  the  weights  on  the  edges  between  vertices  of 
different  classes).  In  classical  max-flow  min-cut  theory,  to  obtain  the  final  classification  by  solving 
the  maximum-flow  problem,  one  can  use  the  information  of  the  flow  on  the  source  and  sink  edges. 
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If  for  x  G  V,  there  is  a  non-saturated  path  between  s  and  x,  then  x  is  in  class  1.  If  there  is  a 
non-saturated  path  between  x  and  t,  then  x  is  in  class  2. 

In  this  paper,  we  instead  solve  the  max-flow  problem  (37)  by  continuous  optimization.  Intro¬ 
ducing  such  a  Lagrange  multiplier  (see  background  section  on  optimizing  function)  for  the  flow 
conservation  constraint  (41)  leads  to  the  the  primal  —  dual  formulation  of  (37): 


subject  to 


min  max  \  E(psiptlp]  A)  =  ps(x) 

A  ps,pt,p  t  ^ ' 

x£V 

+  ^  Kx)  ( div^  p  -  Ps  +  pt)  } 
xev 

(42) 

\p(x,y)\  <  C, 

(x,y)  e  E- 

(43) 

Ps(x)  <  cs(x), 

Vx  e  V; 

(44) 

Pt(x )  <  ct(x), 

Vx  e  V. 

(45) 

Rearranging  the  terms,  we  obtain 

min  max  j  E(ps ,  pt ,  p;  A)  =  E  { (l  -  A )ps  +  Xpt  +  A  div^p}  |  (46) 

^  ^  ^  x£V 

subject  to  (43),  (44)  and  (45). 

By  maximizing  the  above  problem  for  ps,pt  and  p,  in  the  continuous  (or  in  the  graph)  case,  we 
obtain  the  closed  form  expression  (33)  subject  to  the  constraint  (34).  If  the  constraint  (34)  was  not 
met,  the  energy  could  become  arbitrarily  large  by  chosing  ps  or  pt  arbitrarily  low,  contradicting 
boundedness  from  above. 

In  order  to  optimize  the  problem  (46),  let  us  first  consider  a  continuous  setting  and  the  problem 


f(q )  =  maxp  '  <?•  (47) 

p<C 

When  q  <  0,  we  can  choose  p  to  be  negative  infinity  to  maximize  p  ■  q,  which  gives  f(q)  =  +oo. 
Also,  if  q  =  0,  then  p  <  C  and  f(q)  reaches  maximum  at  0.  If  q  >  0,  then  p  =  C  and  f(q)  reaches 
maximum  at  q  ■  C.  Therefore, 


Define 


f(q) 


{oo  if  q  <  0. 
q  •  C  otherwise. 


fs(x)  =  max  (1  -  A(x))  -p8{x), 

Ps{X)<Cs{x) 


(48) 


(49) 
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(50) 


Then  we  have  the  following: 

fs(x)  = 

and 


ft{x)  =  max  \(x)-pt(x). 

pt(x)<Ct{x) 


(1  —  A(x))  •  Cs(x)  if(l  —  A(x))  >  0, 

oo,  otherwise 


(51) 


ft(x)  = 


X(x)  ■  Cs(x)  ifA(x)  >  0, 
oo,  otherwise 


(52) 


By  [53],  it  is  known  that 


max  /  Adi  vpdx  =  /  K I V  A I  dx , 

P(x)\<cjn  Jn 


(53) 


where  K  is  a  constant.  Here,  we  write  p(x)  because  this  is  a  continuous  setting,  defined  on  a  plane. 
We  can  also  derive  an  equivalent  of  this  equation  in  a  graph  setting.  We  have 

(A,  d\vwp)v  =  Y  \x(x)(Yw(x’y^p(x’y}  ~P(yix))}  = 

x  y 

_  j  _  J  _  ^ 

YX (x)nw(xiy)p(xiy)  -  Y  9 Kx)w(x,y)p(y,x)  =  Y  t(A(x)  “  A (y))w(x,y)p(x,y). 


x,y 


X,y 


x,y 


Since 


_  J  J  _ 

max  V  o(A(x)  -  Hy))w(x,y)p(x,y)  =  -5^w(x,?/)|A(x)  -  A(y)|, 


p(x,y)<C  2 
x,y 


x.y 


we  see  that 


max  (A,  divM,  p) y  =  K  ■  TVW(X),  (54) 

p(oc,y)<C 

where  K  is  a  constant. 

By  (51), (52)  and  (54),  in  a  continuous  setting,  maximization  of  (46)  over  flows  ps,  pt  and  p 
leads  to  the  equivalent  dual  model  (33)  subject  to  the  constraint  (34).  We  thus  have  the  following 
equivalent  formulations: 

Primal  Formulation 

max  \P(ps,Pt,p)  = 

Ps,Pt,P  l  Z— ^  J 

x£V 

subject  to  (38)-  (41). 

Primal-Dual  Formulation 

min  max  \  E(ps,pt,p;  A)  =  VV(z)  +  A(x)(di vwp-ps  +  pt)  \ 

A  ps,pt,p  t  > 

x£V  xGV 
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subject  to  (38)-  (40). 

Dual  Formulation 

min  E(u)  =  TVW(  A)  +  g(X(x),x). 

{A:y^[0,l]}  ^ 

xev 

Existence  of  dual  and  primal-dual  solutions  follows  by  the  minimax  theorem.  Prop.  2.4  of  [35] 
Chapter  VI.  In  more  detail,  the  constraints  of  the  flows  are  convex,  and  the  function  is  linear  in 
all  the  variables,  and  thus  convex  l.s.c.  for  a  fixed  A  and  concave  u.s.c.  for  fixed  ps,pt  and  p. 
Therefore,  we  have  the  existence  of  at  least  one  saddle  point.  Also,  the  min  and  max  operators  of 
the  primal  model  can  be  exchanged: 

min  ma x{E(ps,pt,p-,  A)  =  max  min {E(ps,pt,p;  A)}.  (55) 

A  ps,pt,p  Ps,pt,p  A 

When  we  maximize  the  primal-dual  problem  over  A,  we  obtain  the  primal  model: 

P(Ps,Pt,P )  =  min {E(ps,pt,p;  A)}.  (56) 

A 

Note:  Connection  between  variables 

Solving  the  primal  problem,  how  can  one  obtain  A  or  the  solution  to  the  min-cut  problem?  Let 
(A *(x),p*s,pl,p)  be  the  optimal  primal-dual  pair  of  (46).  Now  observe  that  if  ps(x)  <  Cs(x )  (flow 
is  unsaturated),  we  have  1  —  A*  (a;)  =  0,  i.e. 

P*(x)  <  Cs(x)  =>  A* (a;)  =  1.  (57) 

Therefore,  we  class  vertex  x  as  being  of  class  2.  Morever,  fs(x)  =  (1  —  A *(x))p*(x)  =  0,  so  that 
Pt(x)  =  C't(x).  Thus,  that  flow  is  saturated. 

Similarly,  in  the  case  that  p*(x)  <  Ct(x)  (flow  is  unsatured),  we  have  A*  (a;)  =  0,  i.e. 

p*t(x)  <  Ct(x)  =>  A*(a;)  =  0.  (58) 

Therefore,  we  class  vertex  x  as  being  of  class  1.  Morever,  ft(x)  =  X *(x)p*(x)  =  0,  so  that 
p*s(x)  =  Cs(x).  Thus,  the  flow  from  the  source  is  saturated. 

One  can  also  obtain  obtain  the  solution  to  the  min-cut  problem  using  the  primal-dual  problem 
by  the  Lagrange  multiplier  or  A. 

Therefore,  the  remarks  above  show  an  elegant  relationship  between  A  and  the  three  variables  of 
the  primal  problem. 
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4.1.2  Max-flow  Formulation  with  Hard  Supervised  Constraints 


We  also  describe  another  formulation  of  the  problem,  which  avoids  using  a  fidelity  term  that 
is  forced  to  take  on  a  very  large  value  of  r j  to  enforce  that  the  supervised  constraints  are  satisfied. 
Define  first  the  binary  functions 

l,  x  e  Vf 

0,  otherwise 

and  consider  the  following  modification  of  the  max-flow  problem  (37) 


v  x  = 


o,  x  e  Vb 

1,  otherwise 


subject  to 


max 

Ps  ,Pt,P 


{  P(Ps,Pt,P)  =  l,bPs 


xev 


(59) 


\p(x,y)\  <  C,  ( x,y)eE ; 

ps(x)  <0,  Vx  €  V ; 

Pt(x)  <0,  \/x  €  V ; 

divu,p(a;)  —  ps(x)  +  pt(x)  =  0,  V  x  G  V. 


(60) 

(61) 

(62) 

(63) 


Introducing  the  Lagrange  multiplier  A  for  constraint  (63),  we  obtain  the  following  Lagrangian 
formulation  after  rearrangement  of  the  terms: 


mm  max 

A  pa  ,pt  ,P 


Pt,  p\  A)  =  { ( Vb  -  A )ps  +  (A  -  vf)pt  +  A  divu,  p) 


xev 


(64) 


As  there  are  no  lower  bounds  on  ps  and  pt,  it  can  be  observed  that  optimal  solutions  A  must  satisfy 
the  constraints 

v*  <  A  <  vb,  (65) 


otherwise  the  energy  could  be  made  arbitrarily  large.  Note  that,  by  maximizing  for  the  flows  ps,  pt 
and  p  in  continuous  space,  we  therefore  obtain  the  primal  problem 


min  TVW(X)  (66) 

a  eB' 

subject  to  (65).  See  previous  section  for  the  ideas  of  the  derivation. 

Also,  if  A*  is  a  solution  to  (66)  subject  to  supervised  constraints,  then  A*  is  obviously  also  a 
solution  to  (33)  provided  the  penalty  parameter  q  in  the  fidelity  term  of  (33)  is  chosen  sufficiently 
high. 

The  method  of  solving  the  formulation  is  outlined  as  “Convex  Method  Version  Is”. 
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Algorithms 


The  dual  problems  (37)  or  (59)  are  solved  by  the  augmented  Lagrangian  method  as  in  [108, 109]. 
To  solve  (37),  construct  first  the  augmented  Lagrangian  function  corresponding  to  (37): 

Lc{ps,Pt,P,  A)  =  iPs  +  ~  Ps  +  Pt)}  ~  7:  \\divwp  -  ps  +  pt\\l  , 

xev 

where  ||s||2  =  X*ey  ls(^)ll  - 

The  goal  is  to  solve 

max  <P(ps,pt,P )  =  y^ps(x)i 

Ps,Pt,P  l  Z '  ) 

xEV 

under  constraints  (38)  -  (41)  by  solving 

min  max  Lc{ps,pt,p,  X). 

A  ps,pt,p,  A 

An  augmented  Lagrangian  method  can  be  applied  to  solve  this  problem  by  alternatively  maximiz¬ 
ing  Lc  for  the  dual  variables  ps,pt  and  p  with  constraints  (43)-(45)  and  updating  the  Lagrange 
multiplier  A  as  follows: 


Convex  Method  (Version  1)  Max-flow  Algorithm 


Initialize p\ ,  p\ ,  p1  and  A1.  For  k  =  1, ...  until  convergence: 
•  Optimize  p  flow 


Q 

pk+1  =  arg  max  —  -  ||divTOp  —  F 

\p(e)\<CVe£E  2 


fel  2 
2  ’ 


where  Fk  =  psk  —  ptk  4-  —  is  fixed. 
Optimize  source  flow  ps 


9  WPa  “  G 

Ps{x)<Cs(x)\/X  ev  xgy  ^ 


U  11 

ps  =  arg  max 


k  II2 

2 


where  Gk  =  ptk  +  divu,  pk 

Optimize  sink  flow  pt 


—  is  fixed. 

C 


Pt+1  —  arg  max  —  -  \\pt  ~  H 


Pt(x)<Ct(x)  VzGV 


fc||2 
2  ’ 


(67) 


(68) 


(69) 


where  Hk  =  psk+1  —  div^pfc+1  +  —  is  fixed. 


fc+l  ,  Afc 
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Update  A 


A k+1  =  A k  -c-  (div^  pk+1  -  pks+1  +  pk+l) . 


To  solve  (59),  construct  the  augmented  Lagrangian  function: 

Lc(ps,Pt,P,  A)  =  ^2  i vbPg  ~vfPt  +  A(di ywp-Ps  +  Pt)}  ~  7,  ||divwp-ps  +  Pt\\l  ■ 

x£V 

The  goal  is  to  solve 

max  )  (■ vbps  —  v-'pt ) 

Ps,Pt,P 

xGV 

under  constraints  (38)  -  (41)  by  solving 

min  max  Lc(ps,pt,p,  A). 

A  pa,pt,p,  A 

An  augmented  Lagrangian  method  can  be  applied  to  solve  this  problem  by  alternatively  maxi¬ 
mizing  Lc  for  the  dual  variables  ps ,  pt  and  p  with  constraints  (60)-(62)  and  updating  the  Lagrange 
multiplier  A.  The  augmented  Lagrangian  method  for  (59)  thus  becomes: 


Convex  Method  (Version  Is)  Supervised  Max-flow  Algorithm 


Initialize p\ ,  p\ ,  p1  and  A1.  For  k  =  1, ...  until  convergence: 

•  Optimize  p  flow 

pk+1  =  arg  max  —  -  1 1  divM,  p  —  F 

\p(e)\<C  Ve£E  2 

where  Fk  =  pk  —  pk  +  —  is  fixed. 

•  Optimize  source  flow  ps 


k  II2 
2  ’ 


h_|_1  L 

ps  =  arg  max  y.  v  Ps 

Ps(x)<0  Vx£V  xeV 


i  Up.  -  c 


fc|  2 
2  ’ 


where  Gk  =  pk  +  divu,  pk+1  —  —  is  fixed. 
Optimize  sink  flow  pt 


pk+1  =  arg  max  —  V''  v^pt - ||  pt  —  H 

pt(x)<  ovxgv  xeV  2 


fcl  2 
2 


(70) 


(71) 


(72) 


where  Hk  =  psk+1  —  div^pfc+1  +  —  is  fixed. 


fc+l  ,  Afc 
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Update  A 


Afc+i  =  _  c  (diwfc+1  -  pks+l  +  pk+l) . 


Due  to  the  relation  between  problem  (64)  and  problem  (33),  the  output  A  at  convergence  will 
be  a  solution  to  (33).  Similarly,  if  r/  is  chosen  sufficiently  high  in  (33),  then  solution  A  to  (46)  will 
also  be  a  solution  to  (33). 

By  Theorem  1,  one  can  obtain  a  partition  which  solves  (32)  by  the  thresholding  procedure 
described  in  (2). 

The  subproblems  (67)  and  (70)  for  updating  p  are  solved  by  one  step  of  the  following  procedure: 

Pk+1  =  Hw  (pk  +  cVw(divwpk  -  Fk ))  . 


Above,  Uw  is  a  projection  operator  which  is  defined  as 


Uw(p(x,y)) 


{. p(x,y )  if  \p(x,y)\  <  1, 

sgn (p{x,y))  if  \p(x,  y)\  >  1, 


where  sgn  is  the  sign  function. 

The  subproblems  (68)  and  (69)  can  be  solved  by 


(73) 


ps(x)  =  min(G'fc(x)  +  -,(7s(x)); 

c 

Pt{x )  =  min  (Hk(x),Ct(x)). 

The  subproblems  (71)  and  (72)  can  be  solved  by 

ps{x)  =  min  (Gk(x)  +  —  ,Cs(x)); 

c 

Pt(x)  =  min (Hk(x)  -  —,Ct(x)). 

c 

Note  that  the  gradient  and  divergence  operators  in  the  algorithm  are  constructed  using  the  graphical 
framework,  as  shown  by  equations  (3)  and  (4). 


4.2  Convex  Method  (Version  lb):  Max-flow  With  Balancing  Constraints 
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This  section  demonstrates  how  to  incorporate  balancing  constraints  of  the  form  (28),  which  take 
into  account  the  size  of  the  classes.  This  lets  the  user  take  advantage  of  any  prior  knowledge  of 
the  size  of  the  two  classes.  Volume  constraints  have  been  incorporated  into  image  segmentation 
models  in  a  convex  framework  in  [69, 85].  We  propose  an  efficient  algorithm  for  incorporating  the 
hard  volume  constraint  (28)  on  graphs  by  slightly  modifying  the  dual  max-flow  problem  using  a 
new  variable  p  :  V  — »  M  as  follows: 


max  <P(ps,pt,P )  =  y^(pa(x)-ap)  \  (74) 

ps,Pt,P,P  l  Z '  ) 

xEV 

subject  to 

\p(x,y)\  <  C,  ( x,y)eE ;  (75) 

Ps{x)<Cs(x)t  VieT;  (76) 

Pt(x)  <  Ct(x),  Vi6k;  (77) 

divwp(x) -ps{x)  +pt(x)  +  p  =  0,  Vi6k;  (78) 

p  is  a  constant  function.  (79) 


Introducing  a  Lagrange  multiplier  A  for  the  constraint  (78)  yields  the  primal-dual  model 


min  max  \  E(ps,pt,p,  p-  A)  =  (ps\ 

A  ps,pt,p  L  L — 4 

xGV 

[x)  -  ap )  +  y^A(x)( cliv^ p-ps+pt  +  p)} 
xev 

(80) 

subject  to 

\p(x,y)\  <  C,  (x,y)eE; 

(81) 

Ps(x)<Cs(x),  VieT; 

(82) 

Pt(x)<Ct(x),  Vi6k; 

(83) 

p  is  a  constant  function. 

(84) 

Rearranging  the  terms,  we  obtain 

min  max  E(ps,  pt,  p,  p;  A) 

A  ps  ,pt  ,p,p  t 

=  ^2  { (!  -  A )ps  +  Xpt  +  p(A  -  a)  +  A  div.u,  p]  j 

(85) 

subject  to  (81)  -  (77)  and  (79). 

The  intuition  of  having  the  above  model  lies  in  the  following.  We  observe  that  if  A  ^  B',  the 
energy  can  be  arbitrarily  large  by  choosing  ps  or  p,  arbitrarily  small,  contradicting  boundedness 
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from  above.  From  the  second  to  last  term,  it  follows  that  if  the  balancing  constraint  (28)  is  not  sat¬ 
isfied,  the  energy  can  be  made  arbitrarily  large  by  chosing  p  arbitrarily  high  or  low.  Therefore,  by 
maximizing  for  ps ,  pt ,  p  and  p,  we  obtain  the  closed  form  expression  (33)  subject  to  the  constraints 
(34)  and  the  balancing  constraint  (28).  The  derivation  is  very  similar  to  the  one  in  Section  12.3.1. 

In  contrast  to  the  case  with  the  model  without  balancing  constraints,  it  cannot  be  guaranteed  in 
advance  that  a  global  minimizer  is  obtained  by  the  thresholding  procedure  described  in  Theorem  2. 
If  the  solution  is  binary,  it  must  also  be  a  global  minimizer  of  the  binary  constrained  problem,  since 
the  convex  set  B'  contains  the  binary  set  B.  In  the  experiments,  the  solution  tends  to  be  binary  or 
very  close  to  binary,  indicating  that  a  global  minimizer,  or  close  approximation,  can  be  obtained. 
We  again  construct  the  augmented  Lagrangian  functional 

Lc(Ps,Pt,P,  A)  =  Y  -ap  +  ps  +  A(divw p~ps  +  Pt  +  P )  -  ^  ||divwp  ~ps  +pt  +  p\\\  , 

x£V 

which  is  exactly  (4.1.2)  if  p  is  zero. 

The  goal  is  to  solve 

max  {  P(ps,Pt,P )  =  Y  ( Ps(x )  -  ap)  } 

ps,Pt,P,P  l  Z '  ) 

x€.V 

under  constraints  (81)  -  (79)  by  solving 

min  max  Lc{ps,pt,p,  X). 

A  ps,Pt,P,  A 

An  augmented  Lagrangian  method  can  be  applied  to  solve  this  problem  by  alternatively  maxi¬ 
mizing  Lc  for  the  dual  variables  ps,Pt  and  p  with  constraints  (81)-(79)  and  updating  the  Lagrange 
multiplier  A.  The  augmented  Lagrangian  method  for  (59)  thus  becomes: 


Convex  Method  (Version  lb)  Balancing  Constraints 


Initialize  p\ ,  p\ ,  p1  and  A1.  For  k  —  1, ...  until  convergence: 

•  Optimize  p  flow 

pk+1  =  arg  max  -  £  1 1  divu,  p  -  Fk  \  \ 2  ,  (86) 

l|p(e)||<CVee.E  4 

where  Fk  =  psk  —  pk  +  —  —  pk  is  fixed. 
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•  Optimize  source  flow  p . 


Pk+1  =  arg  max  Vps  -  £  || pa  -  Gk\\2  ,  (87) 

Ps(x)<Cs( x)Vx£V  xeV  * 

where  Gk  =  pk  +  divu,  pk+l  —  —  +  pk  is  fixed. 

•  Optimize  sink  flow  pt 

Pt+1  =  arg  max  -  £  \\pt  ~  Hk\\l ,  (88) 

pt{ x)<  Ct{x )  Vx£V  ^ 

where  Hk  =  pk+l  —  di  vwpk+1  +  ^  —  pk  is  fixed. 

•  Optimize  p 

Pk+1  =  arg  max  ^  ap  —  ^  ||p  —  Jk\\ (89) 

p  xev  2 

where  Jk  =  —pk+1  —  div^pfc+1  +  ^  +  pk+l  is  fixed. 

•  Update  A 

Xk+ 1  =  Xk  -  c  (di vwpk+1  -  pk+1  +  pk+1  +  pk+l) . 


The  optimization  problem  (86)  for  p  can  be  solved  by  one  step  of  the  projected  gradient  method 
as  follows: 

pk+l  =  Ilw(p  +  cVw(drvwpk  -  Fk )), 

where  11^  is  the  projection  defined  in  (73). 

The  subproblems  (87)  and  (88)  can  be  solved  by 

ps(x)  =  min(Gfc(x)  +  -,(7s(x)); 

c 

Pt(x )  =  min  (Hk(x),Ct(x)). 

We  solve  (89)  using 

pk+l  =  mean (-p/+1  +  pk+1  +  divu,  pk+1  +pk  -  —  (90) 

c  c 

In  step  (90),  the  constraint  that  p  should  be  constant  is  imposed  exactly  by  computing  the 
average  of  the  pointwise  unconstrained  maximizers  p(x)  for  x  G  V,  and  then  the  average  value  is 
assigned  to  p(x)  V  x  G  V. 

Just  like  in  the  previous  cases,  the  final  classification  is  obtained  by  thresholding  A. 
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4.3  Convex  Method  (Version  2) :  Extension  of  Primal  Augmented  Lagrangian 
Method  to  Graphs 

In  this  section,  we  describe  another  algorithm  for  solving  the  convex  problem  (32),  by  extending 
the  Split-Bregman  algorithm  [55]  for  geometric  problems  [54]  to  general  graphs.  It  has  recently 
been  shown  [95, 107]  that  the  Split-Bregman  algorithm  is  equivalent  to  solving  a  specialized  de¬ 
composition  of  total  variation  regularized  problems  by  the  augmented  Lagrangian  method.  We  use 
the  augmented  Lagrangian  notation  when  describing  the  algorithm,  since  this  notation  has  already 
been  introduced  in  Section  4.1  when  deriving  the  max-flow  algorithms. 

Consider  the  general  minimization  problem 

min  TVw(u)  +  'S~'  g(u(x),x).  (91) 

x£V 

In  our  case,  we  wish  to  choose  g  according  to  (31)  and  impose  the  constraint  u  €  B'.  The  idea 
is  to  solve  (91)  by  writing  this  unconstrained  problem  as  a  constrained  one  by  addition  an  extra 
variable.  This  results  in  the  problem 


min  |  g|  i  +  g(u(x),x) 

u.q  L ' 

(92) 

s.t.  q  =  VyjU, 

(93) 

where  Ws^  =  £xev  |s(x)|. 

The  idea  is  to  solve  this  by  the  augmented  Lagrangian  method.  We  introduce  a  Lagrange  mul¬ 
tiplier  <p  for  the  constraint  (93).  This  results  in  the  augmented  Lagrangian  functional 

Lc(u,q,<l>)  =  IMIi  +  Y  {g(u(x),x)  +  <f)-  ( q-Vwu )}  +  ^  \\q  -  V  wu\\\  ,  (94) 

xev 

where  c  is  a  constant  and  ||s||2  =  Is ix)  |2- 

The  idea  is  to  solve  (92)  with  constraint  (93)  by  finding  a  saddle  point  of  (94)  over  u,  q  and  (j>\ 

max  min  Lc(u,q,4>)- 

<j)  u,q 

This  one  can  achieve  by  alternating  between  minimizing  for  u  and  q 

C uk ,  qk)  =  arg  min  Lc(u,  q,  Xk)  (95) 

u,q 
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and  updating  the  Lagrange  multiplier  by  one  step  of  gradient  ascent: 

Xk+ 1  =  Xk  +  c(qk+l  —  Vuk+1). 

The  minimization  problem  (95)  can  be  separated  into  two  subproblems: 

min  V'  {g(u(x),x)  -  Xk  ■  Vwu\  +  x  \\q  -  Vwu\\22  ; 

U  * - *  Z 

xev 

min  \\q\\1  +  ^2Xk  ■  q  +  ^\\q-  Vwu\\l . 

x&V 

Therefore,  the  algorithm  is  the  following  (it  will  be  referred  to  in  the  text  as  version  2  of  the 
convex  method  or  the  primal  augmented  Lagrangian  algorithm): 

Convex  Method  (Version  2)  Primal  Augmented  Lagrangian  Algorithm 

Initialize  A1,  q 1  and  u 1 .  For  k  =  1, ...  until  convergence: 

•  Optimize  A 

uk+1  =  min  ^2  {g(u(x),x)  -  Xk  ■  Vwu}  +  C-\\qk  -  .  (96) 

u  x&V 

•  Optimize  q 

qk+l  =  min  H^lli  +  ^uk  -q  +  ^Wq-  Vwuk+1\\2  .  (97) 

q  L '  Z 

X&V 

•  Update  Lagrange  multipliers 

Xk+i  =  Xk  +  c^qk+ 1  _  Vvk+\y  (98) 


The  final  binary  classification  is  obtained  by  thresholding  u  to  either  0  or  1. 
The  subproblem  (96)  gives  the  Euler-Lagrange  equation: 

+  cdivw(qk  -  X7wu)  +  clivw(Xk )  =  0, 


where  in  this  case  =  C, 


Cs. 


du 

We  solve  the  above  subproblem  using  one  step  of  forward  Euler: 


uk+1  —  uk 
dt 


=  ~{Ct  -Cs  +  cdivw(qk  -  Vwuk)  +  divu,(Afc)). 
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This  becomes 


uk+1  —  uk 
dt 


=  -( Ct  -  Cs  +  cdivw(qk)  -  cAwuk )  +  divw(\k)). 


All  the  operators,  as  was  stated  before,  are  formulated  in  a  graph  setting,  as  shown  in  section  10. 
We  solve  subproblem  (97)  in  the  same  way  as  it  is  done  in  [107]: 


qk+1  =  | U1-  wkT\)b(x’y)’  if  \Kx,y)\  >  i, 

|o,  if  \b(x,y)\  <  1, 

where  b  =  \cVwu  —  Xk\. 

We  have  tried  solving  the  subproblem  (96)  above  in  another  way.  A  similar  scheme  is  used, 
except  the  Laplace  operator  is  calculated  implicitly,  and  we  proceed  further  by  considering  its 
terms  as  a  linear  combination  of  the  eigenvectors  of  the  random  walk  Laplacian,  in  a  similar  way 
as  in  [9]  and  [80].  Only  a  small  fraction  of  the  eigenvectors  are  used.  This  way  of  solving  the 
subproblem  turns  out  to  be  several  times  faster  than  the  one  previously  discussed,  but  because 
it  does  not  perform  well  in  the  case  of  non-random  fidelity,  we  did  not  use  it.  A  disadvantage 
of  this  method  is  that  it  only  uses  a  small  fraction  of  the  eigenvectors,  which  might  not  contain 
enough  information  to  result  in  an  accurate  classification,  as  was  the  case  with  experiments  with 
non-random  fidelity.  However,  it  also  has  some  advantages,  which  are  discussed  in  the  next  section. 


Avoiding  trivial  global  minimizers 

If  the  number  of  supervised  points  V*  U  Vb  are  very  low,  the  global  minimizer  of  (22)  may  just 
be  the  trivial  solution  S  =  V?  or  S  =  Vb.  This  was  the  case  for  a  small  number  of  our  experiments. 
In  order  to  avoid  this  problem,  the  cost  of  these  trivial  solutions  can  be  increased  by  increasing  the 
number  of  edges  incident  to  the  supervised  points  V f  U  Vb,  which  amounts  to  adding  nonlocal 
behavior.  Non-supervised  points  in  the  graph  are  connected  by  edges  to  their  M  nearest  neighbors. 
Supervised  points  can  instead  be  connected  to  their  K  nearest  neighbors,  where  K  >  M,  thereby 
increasing  the  cost  of  the  partitions  S  —  V*  and  S  =  Vb. 

An  interesting  observation  is  that  if  the  subproblem  (96)  in  the  primal  max-flow  algorithm  is 
solved  via  the  approximate  eigendecomposition,  the  algorithm  does  not  result  in  the  trivial  solution 
S  =  Vf  and  S  =  Vb,  even  when  the  number  of  supervised  points  are  very  low.  The  reason  for 
this  seems  to  be  that  the  approximation  resulting  from  not  using  all  the  eigenfunctions  erases  the 
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unwanted  trivial  global  minimizers  from  the  energy  landscape. 

Note  that  the  second  eigenvector  of  the  Laplacian  already  provides  a  solution  to  a  cut  using  a 
spectral  clustering  approximation  approach.  Although  we  experiment  with  using  that  approxima¬ 
tion  as  an  initialization,  the  methods  work  just  as  well  when  random  initialization  is  used. 

4.4  Results 

The  results  for  several  data  sets  are  summarized  in  Table  10,  with  those  of  the  best  method 
highlighted.  In  all  experiments,  we  have  constructed  the  graph  using  the  M  nearest  neighbors  and 
approximated  the  eigendecomposition  of  the  Laplacian  using  only  the  M  largest  eigenvalues.  By 
“random  fidelity”,  we  mean  choosing  supervised  points  randomly.  By  “corner  fidelity”,  we  mean 
choosing  supervised  points  in  a  certain  portion  of  the  data  set  only,  in  this  case  in  the  “corner” 
portion  of  the  eigenvector  graph.  The  technique  described  in  Section  4.3  for  avoiding  trivial  global 
minimizers,  was  used  on  the  two  moons  data  set  in  Fig. 22  in  case  of  less  than  3.25  %  supervised 
points.  The  results  were  computed  on  a  2.4  GHz  Intel  Core  i2  Quad. 

In  the  case  when  we  need  to  compute  the  eigenvectors  and  eigenvalues  of  the  random  walk 
graph  Laplacian,  we  first  use  a  fast  numerical  solver  called  the  Rayleigh-Chebyshev  procedure  [1] 
to  compute  those  of  the  symmetric  graph  Laplacian.  One  can  then  use  the  previously  described 
relationship  between  the  eigendecomposition  of  the  symmetric  graph  Laplacian  and  that  of  the 
random  walk  graph  Laplacian.  The  Rayleigh-Chebyshev  procedure  itself  is  a  modification  of  an 
inverse  subspace  iteration  method  using  adaptively  determined  Chebyshev  polynomials.  It  is  also 
a  robust  method  that  converges  rapidly  and  that  can  handle  cases  when  there  are  eigenvalues  of 
multiplicity  greater  than  one.  The  calculations  are  made  even  more  efficient  by  the  fact  that  only 
a  small  portion  of  the  eigenvectors  need  to  be  calculated,  as  the  most  significant  nodes  contain 
enough  information  to  produce  accurate  results.  To  have  a  fair  comparison,  we  use  the  same 
number  of  eigenvectors  per  data  set  for  all  methods. 

We  have  used 

f{X{x),x)  =  rj(x)\\(x)  -  A0(a;)|2  (99) 

for  all  our  computations.  Here,  Ao  is  the  initial  value  of  A,  and  r](x)  is  a  function  that  takes  on  a 
value  of  a  constant  r/  on  fidelity  points  and  zero  elsewhere. 

Below,  we  provide  more  detail  on  the  results  for  each  of  the  benchmark  data  sets,  as  well  as  a 
description  of  the  data  set  itself.  In  addition,  we  provide  a  comparison  of  the  results  to  those  of 
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some  of  the  best  methods,  including  the  binary  MBO  and  GL  algorithms.  The  binary  GL  algorithm 
is  the  multiclass  GL  algorithm  with  the  number  of  classes  equal  to  2. 

MNIST  Data  Set 

The  MNIST  digits  data  set  [72],  available  at  http://yann.lecun.com/exdb/mnist/,  is  a  data  set 
of  70000  28  x  28  images  of  handwritten  digits  from  0  —  9.  However,  since  our  method  is  only 
binary,  we  obtained  a  subset  of  this  set  to  classify,  in  particular,  digits  4  and  9  (since  these  digits 
are  sometimes  hard  to  distinguish,  if  handwritten).  This  created  a  set  of  13782  digits,  each  either 
4  or  9.  Starting  from  some  initial  classification  of  the  points  and  using  only  a  small  fraction  of  the 
set  as  fidelity,  the  goal  is  to  classify  each  image  into  being  either  a  4  or  9.  We  used  M  =  10. 

Using  random  initialization  and  random  fidelity,  the  max-flow  method  obtained  an  accuracy  of 
98.48%  averaged  over  100  runs  with  different  fidelity  sets  of  500  randomly  chosen  points  (or  only 
3.62%  of  the  set).  The  primal  augmented  Lagrangian  method’s  accuracy  was  around  98.44%.  The 
accuracy  of  binary  MBO  graph  method  from  [80]  and  the  binary  GL  graph  method  from  [9]  was 
slightly  lower  than  that  of  our  methods;  the  algorithms  were  able  to  achieve  an  average  accuracy 
of  98.37%  and  98.29%,  respectively.  Table  10  summarizes  the  above  and  also  shows  results  in 
the  case  when  the  initialization  is  constructed  using  the  thresholded  second  eigenvector  of  the 
Laplacian  or  when  the  fidelity  region  is  chosen  nonrandomly  by  only  considering  points  that  give 
values  in  the  corners  of  leftmost  image  in  Figure  24a.  More  detail  on  the  latter  information  will  be 
given  in  Section  4.4.  The  parameters  for  Convex  Method  (Version  1)  were:  c  =  0.5,  rj  =  50.  For 
Convex  Method  (Version  2),  they  were:  c  =  0.008,  r/  =  400,  dt  =  0.032. 

To  compare  with  other  methods,  we  note  a  recent  result  by  Hu  et  al.  [66],  which  is  an  unsu¬ 
pervised  method.  The  authors  of  the  paper  also  tested  only  digits  4  and  9  and  obtained  a  purity 
score  (measures  the  fraction  of  the  nodes  that  have  been  assigned  to  the  correct  community)  of 
0.977.  The  GenLouvain  algorithm  obtained  a  purity  score  of  0.975.  In  addition,  many  other  al¬ 
gorithms  have  used  the  full  MNIST  data  set  with  all  10  digits.  For  example,  Cheeger  cuts  [100], 
boosted  stumps  [68,72],  and  transductive  classification  [101]  have  obtained  accuracies  of  88.2%, 
92.3%-98.74%,  and  92.6%,  respectively.  Also,  papers  on  Axnearest  neighbors  [71,72],  neural  or 
convolutional  nets  [25,71,72],  nonlinear  classifiers  [71,72]  and  SVM  [29,71]  report  accuracies  of 
95.0%-97.17%,  95.3%-99.65%,  96.4%-96.7%,  and  98.6%-99.32%,  respectively.  The  aforemen¬ 
tioned  approaches  are  supervised  learning  methods  using  60,  000  out  of  70,  000  digits  (or  about 
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85.71%  of  the  whole  data  set)  as  a  training  set.  Morever,  we  compare  our  method  with  [13],  which 
obtains  98.05%  accuracy  by  knowing  10%  of  the  labels,  97.78%  by  knowing  5%  of  the  labels,  and 
97.72%  by  knowing  2.5%  of  the  labels.  Our  algorithms,  taking  only  3.6%  of  the  data  as  fidelity, 
obtain  around  98.5%  accuracy,  and  thus  are  competitive  with  or  outperform,  these  methods.  More¬ 
over,  we  have  not  performed  any  preprocessing  or  initial  feature  extraction  on  the  data  set,  unlike 
most  of  the  mentioned  algorithms. 

Banknote  Authentication  Data  Set 

The  banknote  authentication  data  set,  from  the  UCI  machine  learning  repository  [4],  is  a  data 
set  of  1372  features  extracted  from  images  (400  x  400  pixels)  of  genuine  and  forged  banknotes. 
Wavelet  transform  was  used  to  extract  the  features  from  the  images.  The  goal  is  to  segment  the 
banknotes  into  being  either  genuine  or  forged.  We  used  M  =  15. 

The  results  are  shown  in  Table  10.  With  the  max-flow  method,  for  a  5.1%  fidelity  set,  we  were 
able  to  obtain  an  average  accuracy  (over  100  different  fidelity  sets)  of  around  99.09%,  while  the 
primal  augmented  Lagrangian  method  achieved  a  similar  accuracy  of  98.75%.  The  results  did  not 
deteriorate  much  for  a  smaller  fidelity  set  of  3.6%,  with  the  two  methods  achieving  an  accuracy  of 
98.83%  and  98.29%,  respectively.  The  parameters  for  Algorithm  1  were:  c  =  0.15,  //  =  250.  For 
Algorithm  2,  they  were:  c  =  0.08,  rj  =  50,  dt  =  0.5. 

We  compare  this  result  to  the  binary  MBO  algorithm,  which  achieved  a  lower  accuracy  of 
95.43%  and  93.48%  for  5.1%  and  3.6%  fidelity  sets,  respectively.  For  the  binary  GL  method,  the 
results  were  also  not  as  good-  97.76%  and  96.10%,  for  5.1%  and  3.6%  fidelity  sets,  respectively. 

Two  Moons  Data  Set 


Figure  22:  Two  moons  example  with  max-flow  method 

This  way  the  data  set  is  constructed  was  described  in  Part  1.  We  used  M  =  10. 
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For  the  max-flow  method,  in  the  case  of  65  or  lower  number  of  fidelity  points  (3.25  %),  we 
increased  the  number  of  edges  of  supervised  points  to  others  to  avoid  the  trivial  global  minimizer 
where  all  points  but  the  supervised  ones  are  classified  as  one  class. 

Using  random  initialization  and  random  fidelity,  for  the  max-flow  method,  we  obtained  an 
average  accuracy  of  97.10%  and  97.05%  in  the  case  of  100  and  50  fidelity  points,  resp..  An 
example  of  a  solution  is  shown  in  Figure  22,  with  the  two  classes  colored  in  red  and  blue.  The 
primal  augmented  Lagrangian  method  achieved  an  accuracy  of  around  97.07%  for  100  fidelity 
points  and  around  96.78%  for  50  fidelity  points.  The  parameters  for  Convex  Method  (Version  1) 
were:  c  =  0.5,  rj  =  50.  For  Version  2,  they  were:  c  =  0.32,  r)  =  100,  dt  =  0.008. 

To  compare  this  with  binary  MBO,  the  method  obtained  98.41%  and  97.53%  accuracy  for  100 
and  50  fidelity  points,  resp.,  which  is  very  similar  to  the  results  of  the  binary  GL  graph  method. 

Rods  Data  Set 

We  have  also  tested  this  algorithm  on  two  other  synthetic  data  sets  created  using  the  rods  pic¬ 
tured  in  Figure  23.  Around  2000  uniformly  chosen  points  were  sampled  from  each  image,  and  then 
embedded  in  M100.  Finally,  noise  was  added  to  each  of  the  points.  We  used  M  =  25. 

In  the  case  of  random  fidelity  region,  we  obtained  accuracy  in  the  98th  or  99th  percentile,  no 
matter  what  initialization.  In  the  case  of  fidelity  region  in  the  comer,  we  obtained  interesting 
global  minimizers  for  the  two  data  sets.  The  comparison  of  our  results  with  the  binary  MBO  and 
GL  method  is  detailed  in  the  next  section.  The  parameters  for  Algorithm  1  were:  c  =  0.01,  rj  =  50. 
For  Algorithm  2,  they  were:  c  =  0.016,  rj  =  500,  dt  =  0.512. 

Comparison  of  Convex  Method  Versions  1  and  lb 

We  tested  our  balancing  constraints  max-flow  method  on  several  data  sets  using  random  ini¬ 
tialization  and  fidelity,  and  compared  it  to  the  regular  max-flow  meethod.  It  handles  the  case  of 
a  small  fidelity  region  better  than  the  original  max-flow  method  (Algorithm  1)  and  gives  higher 
accuracy  everywhere.  We  used  the  same  M  as  described  in  the  previous  section.  The  results  are 
displayed  in  Table  9,  and  those  of  the  best  method  (compared  to  Algorithm  1)  are  highlighted.  In 
general,  the  solution  is  very  close  to  binary,  with  some  small  differences  that  may  be  explained  by 
the  finite  stopping  criterion  of  the  algorithm,  indicating  that  close  approximations  to  global  mini- 
mizers  are  obtained.  To  measure  how  close  the  solution  A  is  to  binary,  we  have  computed  the  norm 
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Table  9:  Comparison  of  Convex  Method  Versions  1  and  lb 


Number  of  fidelity  points 

50 

40 

30 

20 

Two  moons-  Max-flow  method  (regular) 

97.05% 

96.92% 

96.86% 

88.22% 

Two  moons-  Max-flow  method  (balancing  constraints) 

97.19% 

97.12% 

97.11% 

96.11% 

Number  of  fidelity  points 

500 

400 

300 

210 

MNIST-  Max-flow  method  (regular) 

98.44% 

98.40% 

98.36% 

93.68% 

MNIST-  Max-flow  method  (balancing  constraints) 

98.59% 

98.48% 

98.45% 

98.41% 

Number  of  fidelity  points 

50 

40 

30 

20 

Banknote  Authentication  Data  Set  (regular) 

98.83% 

98.72% 

98.21% 

96.78% 

Banknote  Authentication  Data  Set  (balancing  constraints) 

98.83% 

98.91% 

98.72% 

98.55% 

XLev  |AW|yt  where  is  defined  in  (2)  and  i  =  0.5.  The  norm  should  be  0  if  A  is  binary.  In 
the  experiments,  the  values  of  the  norm  range  from  0.0005  to  6  *  HP18. 

For  the  two  moons  example,  starting  from  20  fidelity  points,  we  obtained  good  results.  For  50, 
40,  30  and  20  fidelity  points,  we  obtained  97.19%,  97.12%,  97.11%  and  96.11%  average  accu¬ 
racy  (over  100  different  fidelity  sets),  respectively.  While  the  results  of  the  binary  MBO  method 
(without  any  zero  means  constraint)  for  the  two  moons  data  set  achieves  slightly  better  accuracy 
for  50  and  100  fidelity  points  (being  of  97.53%  and  98.41%,  respectively),  we  noticed  that  if  the 
number  of  fidelity  points  is  too  low,  the  method  is  unable  to  perform  well,  as  the  results  depending 
on  the  fidelity  set.  The  same  is  the  case  with  the  max-flow  method  with  no  balancing  constraint. 
For  example,  for  20  fidelity  points,  the  average  accuracy  we  obtained  for  the  method  was  88.22%. 
However,  with  the  balanced  method,  we  still  obtain  a  good  result  (96.11%)  for  a  fidelity  set  con¬ 
taining  as  little  as  20  points.  Thus,  the  advantage  of  the  method  is  that  it  performs  well  with  even 
a  very  small  fidelity  region.  The  results  are  summarized  in  Table  9. 

For  the  MNIST  data  set,  we  obtained  very  good  results  even  for  a  small  number  of  fidelity 
points.  For  500,  400,  300  and  210  fidelity  points  (or  3.6%,  2.9%,  2.2%  and  1.5%  of  the  data, 
respectively),  we  obtained  an  average  accuracy  (over  100  different  fidelity  sets)  of  98.59%,  98.48%, 
98.45%  and  98.41%,  resp..  A  comparison  to  the  results  of  the  regular  max-flow  method  is  in  Table 
9.  Note  that  in  addition  to  giving  at  least  a  slightly  higher  accuracy  everywhere,  it  handles  the  case 
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Table  10:  Comparison  of  MBO  and  Convex  Methods 


max-flow 

convex 

binary 

binary 

method 

MBO 

GL 

MNIST  (3.6%  fidelity)  random  initialization,  random  fidelity 

98.48% 

98.44% 

98.37% 

98.29% 

MNIST  (3.6%  fidelity)  2nd  eigenvector  initialization,  random  fidelity 

98.48% 

98.43% 

98.36% 

98.25% 

MNIST  (3.6%  fidelity)  random  initialization,  corner  fidelity 

98.47% 

98.40% 

62.35% 

64.39% 

MNIST  (3.6%  fidelity)  2nd  eigenvector  initialization,  corner  fidelity 

98.46% 

98.40% 

63.87% 

63.19% 

Banknote  Authentication  Data  Set  (5.1%  fidelity) 

99.09% 

98.75% 

95.43% 

97.76% 

Banknote  Authentication  Data  Set  (3.6%  fidelity) 

98.83% 

98.29% 

93.48% 

96.10% 

two  moons  (5%  fidelity) 

97.10% 

97.07% 

98.41% 

98.31% 

two  moons  (2.5%  fidelity) 

97.05% 

96.78% 

97.53% 

98.15% 

of  a  small  number  of  fidelity  points  better  than  the  original  method.  For  example,  for  210  fidelity 
points,  the  accuracy  is  98.41%  compared  to  93.68%  of  the  original  method. 

For  the  banknote  authentication  data  set  from  the  UCI  Machine  Learning  Repository  [4],  we 
obtained  reasonable  results  for  as  little  as  20  fidelity  points.  The  results  are  shown  in  Table  9.  The 
balancing  constraints  method  obtains  better  results  than  the  original  max-flow  method,  achieving 
an  average  accuracy  (over  100  different  fidelity  sets)  of  98.55%  for  only  20  fidelity  points,  as 
opposed  to  the  accuracy  of  96.78%  of  the  original  max-flow  method. 

For  the  first  rod  data  set,  we  obtained  reasonable  results  starting  from  around  10  fidelity  points 
out  of  around  two  thousand  that  are  in  the  rods  data  set.  For  10  to  20  fidelity  points,  the  accuracy 
was  around  around  96%.  Testing  50  and  100  fidelity  points,  we  obtained  around  99%  accuracy. 

Comparison  of  Our  Convex  Algorithms  to  Binary  MBO  and  GL  Methods 

After  comparing  the  results  of  our  convex  algorithms  to  the  binary  MBO  [80]  and  binary  GL  [9] 
graph  methods,  we  have  reached  the  following  conclusions  based  on  our  work: 

•  As  long  as  the  fidelity  points  are  well  represented  for  each  class  (meaning  the  fidelity  points 
represent  a  whole  variety  of  points  in  the  class),  the  binary  MBO  method  and  the  binary 
GL  method  have  no  trouble  finding  the  correct  minimizer  or  something  very  close.  The 
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Figure  23:  Rod  1  and  Rod  2  Data  Sets 


initialization  might  not  matter;  even  with  a  bad  initialization,  the  local  minimizer  will  still  be 
found.  Our  convex  methods  find  the  local  minimizer  easily. 

•  Problems  occur  when  the  fidelity  is  not  chosen  randomly.  In  this  case,  even  if  the  initialization 
is  random,  the  convergence  might  not  occur  for  the  binary  MBO  and  GL  methods.  In  all  our 
experiments  with  the  rods  data  sets  and  MNIST,  the  local  minimizer  was  not  found  by  the 
two  methods.  However,  our  convex  algorithms  still  found  the  correct  local  minimizer. 

These  conclusions  are  supported  by  the  work  done  on  the  MNIST  digits  data  set,  using  digits 
4  and  9  only.  The  second  vs.  third  eigenvector  of  the  symmetric  graph  Laplacian  are  graphed 
in  Figure  24a,  with  one  digit  represented  by  blue  and  another  by  red.  The  results  of  experiments 
on  this  data  set  are  found  in  Figure  24.  Each  row  represents  a  different  experiment:  first  two 
rows  contain  experiments  with  random  initialization,  while  last  two  rows  contain  experiments  with 
fidelity  in  a  constrained  area.  The  initialization  is  random  for  experiments  in  first  and  third  row, 
and  is  constructed  by  thresholding  the  second  eigenvector  of  the  Laplacian  for  the  results  in  the 
second  and  fourth  row.  The  first  column  represents  the  initialization,  while  the  second  and  third 
columns  are  results  for  the  max-flow  and  binary  MBO  algorithm,  respectively.  The  fidelity  points 
are  marked  by  yellow  and  magenta  for  the  two  classes. 

We  see  that  if  the  fidelity  region  is  well  represented  in  the  data  set,  no  matter  what  initialization, 
none  of  the  algorithms  have  trouble  finding  a  close  to  perfect  solution  (accuracy  is  between  98% 
and  99%-  see  Table  10).  However,  when  the  fidelity  region  is  not  random  (in  this  case  constrained 
to  only  the  nodes  whose  corresponding  entry  in  the  second  or  third  eigenvector  of  the  Laplacian 
match  a  certain  range),  we  see  that  the  binary  MBO  and  GL  algorithms  fail  to  obtain  an  accurate 
solution;  its  accuracy  is  below  70%.  However,  the  max-flow  algorithm  and  the  primal  augmented 
Lagrangian  methods  achieve  an  almost  perfect  solution  of  98.47%  and  98.40%  accuracy,  resp. 
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Third  Eigenvector  Third  Eigenvector  Third  Eigenvector  Third  Eigenvector 
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Second  Eigenvector 
(a)  Ground  Truth 


Second  Eigenvector 
(b)  Random  initialization  and  fidelity 


Second  Eigenvector 
(c)  max-flow  result 


Second  Eigenvector 
(e)  2nd  eigenvector  init.,  random  fidelity 


Second  Eigenvector 
(f)  binary  MBO  result 


Second  Eigenvector 
(h)  Random  initial.,  corner  fidelity 


Second  Eigenvector 
(i)  max-flow  result 


Second  Eigenvector 
(k)  2nd  eigenvector  init.,  corner  fidelity 


Second  Eigenvector 
(1)  max-flow  result 


Second  Eigenvector 
(m)  binary  MBO  result 


Second  Eigenvector 
(d)  binary  MBO  result 


Second  Eigenvector 
(g)  binary  MBO  result 


Second  Eigenvector 
(j)  binary  MBO  result 


Figure  24:  MNIST  Data  set  Results.  Left:  initialization,  supervised  points  are  marked  in  yellow  and  magenta.  Middle: 
max-flow  algorithm  result.  Right:  binary  MBO  result 


The  two  conclusions  are  also  supported  by  the  work  done  on  the  two  rod  data  sets  created  from 
images  displayed  in  Figure  23.  Experiments  with  the  first  and  second  rod  image  are  shown  in 
Figures  25  and  26,  respectively.  The  first  two  rows  represent  cases  with  a  random  fidelity  set, 
while  the  last  two  rows  are  experiments  with  corner  fidelity.  Cases  with  random  initialization  are 
in  first  and  third  rows,  and  second  eigenvector  initialization  is  used  in  experiments  in  the  second 
and  fourth  row.  The  first  column  is  the  initialization,  second  column  is  the  max-flow  algorithm 
result,  and  third  column  is  the  binary  MBO  result. 

For  rod  1 ,  we  found  that  when  the  fidelity  is  chosen  randomly,  the  minimizer  divides  the  bot¬ 
tom  two  rods  from  the  rest  of  the  image.  When  the  fidelity  is  chosen  at  the  corners,  the  minimizer 
is  shown  in  the  bottom  two  rows  of  the  second  column.  The  convex  max-flow  and  primal  aug¬ 
mented  Fagrangian  algorithms  are  able  to  attain  these  minimizers,  while  the  binary  MBO  and  GF 
algorithms  struggle  in  the  case  of  non-random  fidelity.  The  situation  is  similar  with  rod  2. 

Note  about  MBO  Algorithm 

As  noted  in  [80],  the  first  step  of  the  MBO  algorithm  (heat  equation  with  an  extra  term)  was 
solved  using  the  eigenvalue  and  eigenvector  decomposition  of  the  Faplacian.  However,  a  disad¬ 
vantage  of  solving  equations  using  this  method  is  that,  for  it  to  be  efficient,  only  a  fraction  of 
eigenvectors  are  used,  and  they  might  not  contain  enough  information  to  result  in  an  accurate 
classification.  Naturally,  the  more  eigenvectors  one  computes,  the  longer  the  process  will  take. 

As  an  alternative  way  of  solving  the  first  step  (30)  of  the  MBO  algorithm,  we  have  tried  using 
just  the  simple  forward  heat  equation  solver.  However,  this  did  not  result  in  an  accurate  segmen¬ 
tation  in  the  case  of  non-random  fidelity,  thus  not  improving  the  results  from  the  original  way  of 
solving  it.  This  shows  that  the  algorithm  is  getting  stuck  in  a  local  minimum,  since  the  problem  is 
clearly  not  the  lack  of  information  encoded  within  the  small  number  of  eigenvectors  used. 

Comparison  of  Convergence,  Speed,  and  Energy 

The  stopping  criterion  used  for  all  algorithms  was  taken  to  be  the  point  at  which  the  square 
of  the  relative  F2  norm  between  the  current  and  previous  iterate  is  negligible,  or  below  a  certain 
constant  a.  With  the  exception  of  the  MNIST  data  set  (where  a  —  5  *  le  —  10),  the  max-flow, 
binary  MBO  and  GF  algorithms  stabilize  around  a  =  le  —  17  or  le  —  16.  The  primal  augmented 
Fagrangian  method  stabilizes  around  a  =  le  —  08  or  le  —  09. 
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(a)  Random  initialization  and  fidelity 


(b)  max-flow  result 


(d)  2nd  eigenvector  init.,  random  fidelity 


(e)  binary  MBO  result 


(g)  Random  initial.,  corner  fidelity 


(h)  max-flow  result 


(j)  2nd  eigenvector  init.,  comer  fidelity 


(k)  max-flow  result 


(c)  binary  MBO  result 


(f)  binary  MBO  result 


(i)  binary  MBO  result 


(1)  binary  MBO  result 


Figure  25:  Rod  1  Data  Set  Results.  Left:  initialization,  supervised  points  are  marked  in  yellow  and  green.  Middle: 
max-flow  algorithm  result.  Right:  binary  MBO  result 
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(d)  2nd  eigenvector  init.,  random  fidelity 


(e)  binary  MBO  result 


(f)  binary  MBO  result 


(g)  Random  initial.,  corner  fidelity 


(h)  max-flow  result 


(i)  binary  MBO  result 


(j)  2nd  eigenvector  init.,  comer  fidelity 


(k)  max-flow  result 


(1)  binary  MBO  result 


Figure  26:  Rod  2  Data  Set  Results.  Left:  initialization,  supervised  points  are  marked  in  yellow  and  green.  Middle: 
max-flow  algorithm  result.  Right:  binary  MBO  result 


Table  1 1 :  Number  of  Iterations  of  MBO  and  Convex  Methods  and  Runtime  Comparison 


Number  of  iterations 

max-flow 

primal  augmented 

binary  MBO 

binary  GL 

Lagrangian 

MNIST 

426 

2709 

10 

52 

Banknote  Authentication  Data  Set 

314 

725 

7 

449 

two  moons 

1031 

451 

8 

108 

Timing  (s) 

max-flow 

primal  augmented 

binary  MBO 

binary  GL 

Lagrangian 

MNIST  " 

2.88 

43.21 

0.52 

0.78 

Banknote  Authentication  Data  Set 

1.21 

3.76 

0.90 

0.95 

two  moons 

4.13 

5.23 

2.30 

2.98 

aThis  is  the  timing  of  the  method  using  already  computed  weights  and  eigenvalues/eigenvectors  of  the  random  walk 
Laplacian. 

Table  12:  Comparison  of  Final  Energy  of  the  MBO  and  Convex  Methods 


Data  Set 

initial  energy 

max-flow 

final  energy 

primal  augmented 

Lagrangian 

final  energy 

binary  MBO 

final  energy 

binary  GL 

final  energy 

MNIST  (random  fid) 

23223 

789 

789 

798 

804 

MNIST  (non-random  fid) 

23223 

791 

792 

2167 

5363 

Banknote  Authentication 

3308 

30 

37 

51 

42 

two  moons 

3802 

533 

535 

538 

548 

rod  1  (random  fid) 

4159 

146 

148 

163 

159 

rod  1  (non-random  fid) 

4159 

88 

89 

825 

391 

rod  2  (random  fid) 

4528 

171 

176 

186 

184 

rod  2  (non-random  fid) 

4528 

101 

105 

709 

421 
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Table  1 1  includes  information  about  the  number  of  iterations  needed  to  reach  stability,  and  also 
the  timing  results  for  each  data  set. 

We  have  also  computed  the  initial  and  final  energy  for  each  data  set.  The  energy  was  calculated 
using 

E(\)  =  w(x,y)Hx)  ~u(y) I >  (ioo) 

x,y  GV 

where  A  (a;)  is  0  if  node  x  was  classified  to  be  in  the  first  class,  and  1  if  it  was  classified  to  be  in  the 
second  class.  Note  that  the  energy  here  is  exactly  TVW( A).  Table  12  includes  information  about  the 
initial  and  final  energy  for  each  method.  We  see  that  the  max-flow  algorithm  is  able  to  obtain  the 
lowest  energy  in  each  case.  In  general,  one  can  see  that  the  convex  algorithms  are  able  to  obtain 
the  global  minimizer  in  all  cases,  while  the  binary  MBO  and  GL  algorithms  struggle  in  the  case  of 
non-random  fidelity. 
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CHAPTER  5 


Modified  Cheeger  Method 


5.1  Derivation  of  the  Method 


We  consider  the  binary  partitioning  problem  of  the  target  set  X.  The  ratio  cut  problem  can  be 
written  as  minimizing 

over  all  sets  S,  where  S  is  a  subset  of  X.  We  have  X  =  S  U  S. 

Now  let  u  be  a  binary  function  (taking  values  0  or  1)  denoting  the  class  of  each  of  the  nodes. 
Then  one  can  write  the  ratio  cut  problem  as  the  minimization  of 


mm 

u:«(x)£{0,l} 


2  : 

x,y 


\U\X) 


u 


(y)l) 


+ 


Exu(x)  n-]Txu(x) 


over  all  binary  functions  (with  values  0  or  1). 

Similar  logic  is  used  in  [12]  which  states  that  the  problem  of  minimizing 

cut(S,  S) 
minds'),  \S\) 

over  all  partitions  X  =  S  U  S  reduces  to  minimizing 

%Y,x,yw(x’y)\u(x)  -  u(y) I 

J2X  \u{x)  -  med(u)  j 

over  all  binary  functions  u  (with  values  0  or  1),  where  med(u)  is  the  median  of  u. 

Using  the  same  notation,  one  can  write  the  following  formulas  for  the  total  variation  and 
Ginzburg-Landau  (GL)  functionals.  These  definitions  were  used  in  previous  papers  by  Yves  van 
Gennip  et  al.  in  [103]: 

TVw{u)  =  ~^2w(x,y)\u(x)  -u(y) |, 

x,y 

GLe(u)  =  ~^w(x,y){u{x )  -  u(y))2  +  ^  w(u(x)). 

x,y  x 
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The  justification  of  why  there  is  no  e  in  the  first  term  of  Ginzburg-Landau  function  is  justified 
in  [103].  Also,  note  that  the  cut  can  be  exactly  interpreted  as  total  variation.  In  other  words, 
minimizing  the  cut  is  the  same  as  minimizing  the  total  variation  of  the  binary  function  (indicating 
the  class)  that  takes  values  0  or  1  on  the  nodes. 

^  n(n—  1) 

Assume  n  is  the  number  of  vertices  in  the  graph  and  let  V  =  R"  and  £  —  R  2  be  Hilbert 
spaces  defined  via  the  following  inner  products: 


(A,  l)v  =  Hxh{x)d{x), 


=  ~^2il>{x,y)<i>(x,y)w(x,y). 


x,y 


Here,  di  represents  the  degree  of  node  i. 
Let  us  also  define  the  following  norms: 


=  \/(A,A)v  =  \J^2\(x)2d(x), 


M\e  =  \/WWs  = 

V  X’V 

Then,  one  can  rewrite  the  GL  functional  using  the  inner  product  notation: 

GL,(u)  =  \\XJu\\l+1-{D-rW(u),l)v, 

where  D  is  the  degree  diagonal  matrix  and  (D~lW {u)){x)  =  d(x)~lW{ui).  The  factor  d(a;)_1  is 
needed  to  cancel  the  factor  d(x)  in  the  V-  inner  product. 

The  work  [103]  has  the  following  theorem: 

p 

Theorem  3.  GLe  — »  GL0  as  e  — »  0  where 


GL0(X) 


TVw{u )  for  u  s.t.  Ui  G  (0, 1}, 
00  otherwise. 


Therefore,  one  can  interchange  total  variation  (a.k.a.  cut)  with  the  Ginzburg-Landau  functional. 
Replacing  the  TV  term  (the  first  term)  in  (5.1)  by  the  GL  functional,  we  obtain  the  problem  of 
minimizing 


F{u)  =  GL£(u ) 


+ 


n-Exu(x) 
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We  denote  the  second  term  in  the  product  by  C\  (it). 

The  modified  Allen-Cahn  equation  can  then  be  derived  using  the  V-gradient  flow  associated 
with  GL  functional.  Since 

+  tv)  =  GLe(u  +  tv))(C  i(u  +  tv))  +  GLe{u  +  tv)(^Ci(u  +  tv)), 

evaluation  at  zero  gives 

-2(Cl(u)Au,v)v  +  -{C1{u)D~lW'{u),v)v  +  ( GL{u)D~1C2(u),v)v , 
e 

where  D  is  the  diagonal  matrix  containing  the  degree  of  the  node  in  its  diagonal  places, 

„  ,  ,  1  1 

C\(u)  —  ^  + 


and 


Coin)  = 


Exu(x)  n~Exu(x) 

-l  i 


(Exu(x))2  {n-J2,-u(x))2' 

The  modified  Allen-Cahn  equation  is  thus  the  following: 


du(x )  =  2C1(u)(Au)(x)  -  Ci(u)W'(u(x))  GLe(u)C2(u ) 


dt 


ed(x) 


d(x) 


or  equivalently 

du(x 


=  2c\(u)(Au)(x) - - giMTAW) _ 


dt  — /V-/ 
where  rf(x)  is  the  degree  of  node  x. 


ed{x) 


ed(x) 


5.2  The  Algorithm 

Discretizing  (101),  calculating  the  Laplacian  implicitly,  one  obtains 


u?+1  -  un 


„+iwA„,„+1,  C2(«.")||VM"||2  CiWgK)  (E.  gww 


dt 


=  2Ci (un+L)(Aun+l)i  - 


di 


edj 


edi 


(102) 


To  solve  the  above  equation,  we  use  the  eigendecomposition  of  a  Laplacian.  Let 

un  =  ^2  akMx ) 

k 

C2(u)\\Vu\\2  Ci(u)W'(ui)  (EiW(«i))C2(u). 


edi 


edi 
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where  (j>{x)  are  the  eigenfunctions  of  a  Laplacian.  After  plugging  in  these  repsentations  into  (102), 
one  obtains 


a 


,n+ 1 


a 


+  dtbl 


where  Xk  is  the  kth  eigenvalue 


1  +  2dtC\{un)  X 


k 


Therefore,  the  algorithm  consists  of  the  following  five  steps: 

*  Initialize  u°. 

*  Repeat  the  following  steps  from  n  —  0  until  a  stopping  criterion  is  satisfied: 

-  Calculate  C\  =  Ci(un),  C2  =  C2(un). 

-  Calculate  an+1  using  (5.2). 

-  Calculate  bn+1  using  the  transform  of  un. 

-  Calculate  un+l  using  the  inverse  transform  of  an+1. 

*  Threshold  the  solution  to  obtain  a  binary  answer. 


One  can  derive  a  similar  algorithm  using  a  slightly  different  problem  of  minimizing 

cut(S,  S)  (  *  +  ) 

\vol{b)  vol(b )/ 

over  all  sets  S,  where  S'  is  a  subset  of  X.  We  have  X  —  S  U  S.  Now,  C\  and  C2  would  be  the 
following: 

2  2 

Cl(")  =  +  E^W(1-“W)’ 

-2  2 

°2(u)  =  + 

In  addition,  (102)  will  be  changed  to  a  similar  one  but  in  which  the  normalization  by  the  di  from 
the  second  and  fourth  terms  in  the  right  hand  side  of  the  equation. 

However,  the  results  of  the  two  algorithms  do  not  differ  by  much,  and  thus  we  only  state  the 
results  for  the  original  algorithm  in  the  results  section. 

Moreover,  we  are  considering  developing  an  MBO  scheme  for  this  problem,  in  a  similar  way  as 
was  done  in  Chapter  3.  However,  this  is  not  trivial  due  to  the  difficulties  encountered  with  the  new 
thresholding  step  because  of  the  presence  of  the  normalization  factor  in  the  problem  formulation. 
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5.3  General  Laplacian  Framework 


Similar  to  the  procedure  in  [50],  we  proceed  with  testing  different  kinds  of  Laplacians.  To 
derive  some  new  Laplacians,  we  focus  on  the  case  of  a  dynamic  process.  In  a  dynamic  process, 
each  node  is  associated  with  a  dynamic  variable  which  is  allowed  to  change.  The  variables 
evolve  according  to  a  dynamic  process.  In  [50],  the  process  considered  was 

-  =  -ce 

dt 

where  0  is  a  matrix  containing  the  9t  entries.  The  matrix  C  is  called  the  spreading  operator. 

We  consider  the  more  general  framework  for  the  Laplacian: 

C  =  T“2£)-3(D  - 


In  this  representation,  T  is  an  n  x  n  diagonal  matrix  of  nodel  delay  factors.  It’s  ith  diagonal  element 
represents  the  average  delay  of  node  i.  We  also  consider  a  generalization  Z  of  the  traditional  weight 
matrix.  Z  can  be  any  symmetric  positive-  definite  matrix.  We  want  to  study  how  the  results  change 
when  Z  and  T  vary.  We  consider  four  cases: 

Normalized  Laplacian  In  the  case  when  Z  =  W  and  T  =  /,  we  obtain  the  normalized 
symmetric  Laplacian: 

Csym  =  D^(D-W)D~Z 

Scaled  Graph  Laplacian  In  the  case  when  Z  =  W ,  T  =  dmaxD~l  (here,  drnax  represents 
the  maximum  degree  in  the  degree  matrix  associated  with  Z),  the  spreading  operator  is  called  the 
scaled  graph  Laplacian: 

£Sci  =  -t—{d  -  w). 

& max 

Replicator  If  we  let  Z  =  VWV,  where  V  is  a  diagonal  matrix  whose  elements  are  the  compo¬ 
nents  of  the  eigenvector  corresponding  to  the  largest  eigenvalue  of  W,  and  T  =  I,  we  obtain 

£rep  -  I- 

A max 

where  Xmax  is  the  largest  eigenvalue  of  W. 

Unbiased  Adjacency  Matrix  If  we  let  Z  =  D~^WD~^  and  T  =  dmaxD^x  (here,  dmax 
represents  the  maximum  degree  in  the  degree  matrix  associated  with  Z),  we  obtain  the  unbiased 
adjacency  matrix: 

£u„b  =  -l—(D-D-sWD-*). 
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Table  13:  Modified  Cheeger  Method  Results 


Data  Set 

symmetric 

scaled  graph 

replicator 

unbiased  adjacency 

Laplacian 

Laplacian 

matrix 

MNIST 

98.27% 

98.12% 

94.38% 

98.26% 

two  moons 

97.8% 

97.45% 

94.85% 

97.55% 

Table  14:  Modified  Cheeger 


MNIST 


Method 

Accuracy 

symmetric  Laplacian 

98.27% 

scaled  graph  Laplacian 

98.12% 

replicator 

94.34% 

unbiased  adjacency  matrix 

98.26% 

method  in  [12] 

98.36  % 

Method  Results  and  Comparison 


Two  moons 


Method 

Accuracy 

symmetric  Laplacian 

97.8% 

scaled  graph  Laplacian 

97.45% 

replicator 

94.85% 

unbiased  adjacency  matrix 

97.55% 

method  in  [100] 

95.08  % 

method  in  [18] 

93.5  % 

method  in  [63] 

95.38  % 

method  in  [14] 

91.31% 

method  in  [12] 

95.86  % 

5.4  Results 

MNIST  Data  Set 

The  MNIST  data  set  results  are  shown  in  Table  13.  We  test  all  four  Laplacians  shown  in  the 
previous  section.  We  see  that  all  but  the  replicator  achieve  accuracy  in  the  98th  percentile.  Using 
the  replicator  worsens  the  accuracy  to  94th  percentile.  To  compare  to  some  state-of-the-art  work 
involving  some  kind  of  a  cut  (i.e.  Cheeger  cut,  balanced  cut,  etc.)  or  spectral  computations,  we 
note  the  result  of  98.36%  of  Bresson  et  al.  in  [12].  We  see  our  method  achieves  a  result  that  is  very 
similar. 
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(a)  ground  truth 


(b)  symmetric  graph  Laplacian 


(c)  scaled  graph  Laplacian 


(d)  replicator 


(e)  unbiased  adjacency  matrix 


Figure  27:  Modified  Cheeger  Method  MNIST  Data  Set  Results 


Two  Moons  Data  Set 

The  two  moons  data  set  results  are  shown  in  Table  13.  We  test  all  four  Laplacians  shown  in  the 
previous  section.  We  see  that  all  but  the  replicator  achieve  accuracy  in  the  97th  percentile.  Using 
the  replicator  worsens  the  accuracy  to  94th  percentile.  To  compare  to  some  state-of-the-art  work 
involving  some  kind  of  a  cut  (i.e.  Cheeger  cut,  balanced  cut,  etc.)  or  spectral  computations,  we 
note  the  95.08%  result  of  Szlam  et  al  in  [100],  the  93.5%  result  of  Buhler  et  al  in  [18],  the  95.38% 
result  of  Hein  et  al  in  [63],  the  91.31%  result  of  Bresson  et  al  in  [14]  and  the  95.86%  result  of 
Bresson  et  al  in  [12].  We  see  that  our  method  achieves  an  accuracy  that  is  at  least  2%  higher. 
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(c)  replicator 


(d)  unbiased  adjacency  matrix 


Figure  28:  Modified  Cheeger  Method  Two  Moons  Data  Set  Results 


Summary 


We  have  introduced  several  methods  using  a  graphical  setting.  The  advantage  of  using  a  graphi¬ 
cal  setting  is  that  it  is  a  more  general  framework  which  also  allows  one  to  use  nonlocal  information. 
In  images,  it  allows  one  to  capture  texture  and  repetitive  structure. 

First,  an  efficient  semi-supervised  algorithm  (MBO  method)  for  classification  and  image  pro¬ 
cessing  was  described.  The  main  advantage  of  the  method  lies  in  its  efficiency  and  simplicity, 
since  it  consists  of  just  alternating  between  solving  a  modified  heat  equation  and  thresholding.  The 
multiclass  extension  of  the  algorithm  was  presented  using  the  Gibbs  simplex.  The  method  can 
be  applied  to  many  areas,  and  an  application  to  hyperspectral  imagery  was  also  described.  Only 
around  2%-5%  of  the  data  is  needed  to  be  known  to  obtain  an  accurate  answer. 

One  drawback  of  the  MBO  algorithm  is  that  it  is  formulated  from  a  non-convex  optimization 
problem,  so  it  may  be  the  case  that  the  result  is  a  local  minimum.  To  overcome  this  problem, 
we  present  two  versions  of  solving  a  convex  optimization  problem.  The  first  one  is  an  equiva¬ 
lent  maximum-flow  problem,  which  is  solved  using  the  augmented  Lagrangian  method.  We  also 
discuss  some  variations  of  the  version,  such  as  solving  it  using  balancing  constraints  and  hard  su¬ 
pervised  constraints.  The  second  version  is  solving  the  problem  in  its  original  form  using  an  extra 
variable  and  then  proceeding  with  the  augmented  Lagrangian  method.  This  results  in  two  efficient 
algorithms  that  can  be  applied  to  the  problem  of  binary  classification/segmentation.  The  extension 
to  the  multiclass  class  is  a  problem  of  future  research.  Just  like  the  MBO  method,  these  versions 
are  semi- supervised,  although  only  a  very  small  percentage  of  the  data  is  needed  to  be  known  to 
achieve  accurate  results. 

We  also  present  a  modified  Cheeger  method  for  the  problem  of  binary  segmentation/classfication. 
Similarly  to  the  MBO  method,  the  root  of  this  algorithm  also  involves  the  Ginzburg-Landau  func¬ 
tional.  The  advantage  of  this  method  is  that  it  is  unsupervised,  and  thus  requires  no  prior  knowl¬ 
edge,  unlike  the  previous  mentioned  methods. 
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Appendix  A 


Our  methods  involve  the  computation  of  eigenvalues  and  associated  eigenvectors  of  the  symmetric 
graph  Laplacian.  In  practice,  one  computes  only  a  fraction  of  the  eigenvalues  and  eigenvectors, 
and  different  methods  of  doing  so  are  used  depending  on  the  size  of  the  domain. 

When  the  graph  is  sparse  and  is  of  moderate  size,  around  5000  x  5000  or  less,  we  use  a  Rayleigh- 
Chebyshev  procedure  outlined  in  [1].  It  is  a  modification  of  an  inverse  subspace  iteration  method 
that  uses  adaptively  determined  Chebyshev  polynomials.  The  procedure  is  also  a  robust  method 
that  converges  rapidly  and  that  can  handle  cases  when  there  are  eigenvalues  of  multiplicity  greater 
than  one. 

When  the  graph  is  very  large,  such  as  in  the  case  of  image  segmentation,  the  Nystrom  extension 
method  is  used. 


Nystrom  extension  for  fully  connected  graphs 

Nystrom  extension  [44,45]  is  a  matrix  completion  method  often  used  in  many  image  processing 
applications,  such  as  kernel  principle  component  analysis  [34]  and  spectral  clustering  [87].  This 
procedure  performs  much  faster  than  many  alternate  techniques  because  it  uses  approximations 
based  on  calculations  on  small  submatrices  of  the  original  large  matrix.  When  the  size  of  the 
matrix  becomes  very  large,  this  method  is  especially  valuable. 

Note  that  if  A  is  an  eigenvalue  of  W  —  D~^W D~^ ,  then  1  —  A  is  an  eigenvalue  of  Ls,  and  the 
two  matrices  have  the  same  eigenvectors.  We  formulate  a  method  to  calculate  the  eigenvectors  and 
eigenvalues  of  W  and  thus  of  Ls. 

Let  w  be  the  similarity  function,  A  be  an  eigenvalue  of  W,  and  0  its  associated  eigenvector.  The 
Nystrom  method  approximates  the  eigenvalue  equation 

/  w(y,  x)(f>{x)dx  = 

■hi 

using  a  quadrature  rule,  a  technique  to  find  weights  Cj(y )  and  a  set  of  L  interpolation  points  X  = 
{xj}  such  that 
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y \cj{y)<l>(xj)=  /  w{y,x)<j>(x)dx  +  E{y), 
i=i  ^ 

where  L(y)  represents  the  error  in  the  approximation. 

We  use  Cj(y )  =  i/;(y,  Xj)  and  choose  the  L  interpolation  points  randomly  from  the  vertex  set  V. 
Denote  the  set  of  L  randomly  chosen  points  by  X  =  {xi}f=1  and  its  complement  by  Y.  Partioning 
Z  into  Z  =  A"  U  Y  and  letting  4>k(x)  be  the  the  klh  eigenvector  of  if'  and  Xk  its  associated 
eigenvalue,  we  obtain  the  system  of  equations 


y  w(yi,  Xj^^Xj)  =  A k<t>k{yi)  Vyi  G  Y,  X/k  G  1, L. 

Xj£X 

This  system  of  equations  cannot  be  solved  directly  since  the  eigenvectors  are  not  known.  To 
overcome  this  problem,  the  L  eigenvectors  of  W  are  approximated  using  calculations  involving 
submatrices  of  W. 

Let  WXy  be  defined  as 

w(x1,y1)  ...  w(xi,yN-L ) 

w(xL,yi)  ...  w(xL,yN-L ) 

where  W  has  dimension  N  x  N.  The  matrices  f'f'Vx ,  Wxx  and  f'f'Vy  can  be  defined  similarly. 
Notice  that  Wxy-  Wyx  '  -  Then  the  matrix  if'  can  be  written  as 

Wxx  Wxy 
WYx  WVy 

To  calculate  the  eigenvalues  and  eigenvalues  of  W,  one  must  correctly  normalize  the  above 
weight  matrix.  The  correct  normalization  is  achieved  by  the  following  calculations,  where  we 
denote  by  1K  the  K -dimensional  unit  vector. 

Let  the  matrices  dx  and  dY  be  defined  as 

dx  =  Wxx'i-L  +  Wxy^n-l, 
dY  =  f'f'Vx  1  l  +  ( f'f'Vx  WXx  Wxy  )  1  .v-  l  • 
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If  A./B  denotes  componentwise  division  between  matrices  A  and  B,  and  vT  denotes  the  trans¬ 
pose  of  vector  v,  then  define  the  matrices  Wxx  and  WXy  as 

Wxx  =  Wxx-/(sxs§[), 

WxY  =  WXY-/ (sXs\), 

where  sx  =  \fdx  and  sy  =  yfdy. 

It  is  shown  in  [9]  that  if  Wxx  =  BxDB^,  and  if  A  and  Y  are  matrices  such  that 

=  Wxx  +  wxiwXYWyXwxi 

then  the  eigenvector  matrix  V  consisting  of  L  eigenvectors  of  W  and  thus  of  Ls  is  given  by 

BxD'iB^AT-'i 

WyXBxD-^B^AT-h 

while  /  —  T  contains  the  corresponding  eigenvalues  of  Ls  in  its  diagonal  entries. 

Therefore,  the  efficiency  of  the  Nystrom  extension  method  lies  with  the  fact  that  when  com¬ 
puting  the  eigenvalues  and  eigenvectors  of  an  N  x  N  matrix,  where  N  is  large,  it  approximates 
them  using  calculations  involving  only  much  smaller  matrices,  the  largest  of  which  has  dimension 
N  x  L,  where  L  is  small. 

Although  this  method  is  very  efficient,  there  are  problems  when  it  is  applied  to  binary  image 
inpainting,  especially  when  the  image  has  a  repetitive  structure.  This  occurs  because  of  singular  or 
nearly  singular  matrices  that  arise  in  the  calculations  of  the  Nystrom  extension  method.  Therefore, 
in  this  case,  we  use  the  Rayleigh-Chebyshev  procedure  of  [1]  to  calculate  the  eigenvalues  and 
associated  eigenvectors. 
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